Dynamic semantic models having multiple indices

ABSTRACT

Embodiments are directed towards dynamic semantic models having multiple indices. Source data may be provided to a network computer from at least one separate data source. A raw data graph may be generated from the source data such that the structure of the raw data graph may be based on the structure of the source data. Elements of the raw data graph may be mapped to a concept graph. Concept instances may be generated based on the concept graph, the raw data graph, and the source data. Model-identifiers (MIDs) that correspond to the concept instances may be generated to include at least a path in the concept graph. The MID values may be indexed into a plurality of indices based on a content-type of the data associated with the MIDs. In response to a query, a result set may be generated that includes result MIDs.

CROSS-REFERENCE TO RELATED APPLICATIONS

This Utility Patent Application is a Continuation of U.S. patentapplication Ser. No. 14/977,473 filed on Dec. 21, 2015, now U.S. Pat.No. 9,501,578 issued on Nov. 22, 2016, which is a Continuation of U.S.patent application Ser. No. 14/602,192 filed on Jan. 21, 2015, now U.S.Pat. No. 9,218,427 issued on Dec. 22, 2015, entitled “DYNAMIC SEMANTICMODELS HAVING MULTIPLE INDICES,” the benefit of the filing dates ofwhich are hereby claimed under 35 U.S.C. §120 and the contents of whichare incorporated in entirety by reference.

TECHNICAL FIELD

This invention relates generally to information organization and datamodeling and more particularly, to the generation and use and semanticdata models in search and analysis of data.

BACKGROUND

Organization are generating and collecting an ever increasing amount ofdata. Data may be directly or indirectly generated from disparate partsof the organization, such as, consumer activity, manufacturing activity,customer service, quality assurance, or the like. For various reasons,it may be inconvenient for such organizations to effectively utilizetheir vast collections of data. In some cases the sheer quantity of datamay make it difficult to effective utilize the collected data to improvebusiness practices. In other cases, the data collected by differentparts of an organization may be stored in different formats, or storedin different locations. Further, employees within the organization maynot be aware of the purpose or content of the various data collectionsstored throughout the organization. Accordingly, there may be manyuseful insights or correlations hidden in the collected data that areunnoticed or difficult to discover. Thus, it is with respect to theseconsiderations and others that the invention has been made.

BRIEF DESCRIPTION OF THE DRAWINGS

Non-limiting and non-exhaustive embodiments of the present innovationsare described with reference to the following drawings. In the drawings,like reference numerals refer to like parts throughout the variousfigures unless otherwise specified. For a better understanding of thedescribed innovations, reference will be made to the followingDescription of Various Embodiments, which is to be read in associationwith the accompanying drawings, wherein:

FIG. 1 illustrates a system environment in which various embodiments maybe implemented;

FIG. 2 shows a schematic embodiment of a client computer;

FIG. 3 illustrates a schematic embodiment of a network computer;

FIG. 4 shows a logical schematic of a portion of a semantic modelingsystem in accordance with at least one of the various embodiments;

FIGS. 5A and 5B show a logical schematic of a portion of an ingestionengine in accordance with at least one of the various embodiments;

FIG. 6 illustrates a logical representation of a portion of a semanticmodel in accordance with at least one of the various embodiments;

FIG. 7 illustrates a logical representation of a portion of a semanticmodel showing a referential relationship in accordance with at least oneof the various embodiments;

FIG. 8 illustrates a logical representation of a portion of a semanticmodel showing a referential relationship in accordance with at least oneof the various embodiments;

FIG. 9 illustrates a logical representation of a portion of theingestion process for a system in accordance with at least one of thevarious embodiments;

FIG. 10 illustrates model-identifiers in accordance with at least one ofthe various embodiments;

FIG. 11 shows a portion of an index for indexing n-gram valued MIDs inaccordance with at least one of the various embodiments;

FIG. 12 shows a portion of an index for indexing time-based valued MIDsin accordance with at least one of the various embodiments;

FIG. 13 shows a portion of an index for indexing geo-spatial valued MIDsin accordance with at least one of the various embodiments;

FIG. 14 illustrates a logical representation of the modeling process inaccordance with at least one of the various embodiments;

FIG. 15 illustrates a logical representation of a mapping a raw datagraph to a concept model in accordance with at least one of the variousembodiments;

FIG. 16 shows a portion of a forward index in accordance with at leastone of the various embodiments;

FIG. 17 shows an overview flowchart for a process for generating dynamicsemantic models having multiple indices in accordance with at least oneof the various embodiments;

FIG. 18 shows an overview flowchart for a process for ingesting sourcedata for a dynamic semantic model in accordance with at least one of thevarious embodiments;

FIG. 19 shows an overview flowchart for a process for performingpipelined actions to classify information for a dynamic semantic modelin accordance with at least one of the various embodiments;

FIG. 20 shows an overview flowchart for a process for indexinginformation for a dynamic semantic model with multiple indices inaccordance with at least one of the various embodiments;

FIG. 21 shows an overview flowchart for a process for responding toqueries for information from a dynamic semantic model with multipleindices in accordance with at least one of the various embodiments;

FIG. 22 shows an overview for a process for mapping raw data graphelements to a concept graph in accordance with at least one of thevarious embodiments;

FIG. 23 shows an overview flowchart for a process for responding toqueries for information from a dynamic semantic model with multipleindices in accordance with at least one of the various embodiments; and

FIG. 24 shows an overview flowchart for a process for performingnon-pipelined actions to classify information for a dynamic semanticmodel in accordance with at least one of the various embodiments.

DESCRIPTION OF VARIOUS EMBODIMENTS

Various embodiments now will be described more fully hereinafter withreference to the accompanying drawings, which form a part hereof, andwhich show, by way of illustration, specific exemplary embodiments bywhich the invention may be practiced. The embodiments may, however, beembodied in many different forms and should not be construed as limitedto the embodiments set forth herein; rather, these embodiments areprovided so that this disclosure will be thorough and complete, and willfully convey the scope of the embodiments to those skilled in the art.Among other things, the various embodiments may be methods, systems,media or devices. Accordingly, the various embodiments may take the formof an entirely hardware embodiment, an entirely software embodiment oran embodiment combining software and hardware aspects. The followingdetailed description is, therefore, not to be taken in a limiting sense.

Throughout the specification and claims, the following terms take themeanings explicitly associated herein, unless the context clearlydictates otherwise. The phrase “in one embodiment” as used herein doesnot necessarily refer to the same embodiment, though it may.Furthermore, the phrase “in another embodiment” as used herein does notnecessarily refer to a different embodiment, although it may. Thus, asdescribed below, various embodiments may be readily combined, withoutdeparting from the scope or spirit of the invention.

In addition, as used herein, the term “or” is an inclusive “or”operator, and is equivalent to the term “and/or,” unless the contextclearly dictates otherwise. The term “based on” is not exclusive andallows for being based on additional factors not described, unless thecontext clearly dictates otherwise. In addition, throughout thespecification, the meaning of “a,” “an,” and “the” include pluralreferences. The meaning of “in” includes “in” and “on.”

For example embodiments, the following terms are also used hereinaccording to the corresponding meaning, unless the context clearlydictates otherwise.

As used herein, “ontology” refers to a naming and definition of thetypes, properties, and interrelationships of the entities that exist fora particular domain. Ontologies are often defined for particularindustries and/or industry activities. In some cases, an ontology for adomain may be employed as standard describing a particular problemdomain.

As used herein, “model identifier” refers to a data structure that isemployed for identifying an entity in a concept model. Model identifiers(MIDs) comprise structural information as well as value information foran entity. The structural information defines how the entity fits withinthe structure of the concept model. The structural information mayrepresent a path in a graph that corresponds to structure of the model.MIDs may include one or more keys that determine which entitycorresponds to a particular portion of the path. MIDs may also beindexed with a value for the particular instance. See, FIG. 10 and itsaccompanying description for a detailed discussion of MIDs.

As used herein, “concepts,” and “model concepts” refer to the ideas andvalues in a concept model. Fields from one or more nodes in a raw datagraph may be mapped to properties that comprise one or more concepts.

As used herein “concept instance” refers to a particular instance ofconcept in a concept model. For example, a concept model may include aconcept such as Movies. A concept instance represents an individualmovie.

As used herein the terms “concept graph,” and “concept model” refer to agraph where the nodes represent concepts and the edges representrelationships between the concepts. A concept model may be based on orrepresent one or more ontologies. The ontologies that define the modelmay be pre-defined, custom, and/or portions of existing ontologies, orcombinations thereof. A concept model represents the structuralorganization and/or relationship of concepts that may be mapped tofields and/or node in a raw data graph.

As used herein “classifier,” “classifiers” refer to components of thesemantic modeling system used for processing source data (raw data) thatis consumed by the semantic modeling system. In at least one of thevarious embodiments, classifiers that are registered with an ingestionengine are enabled to process the source data to discover, annotate,and/or extract information from the source data. In at least one of thevarious embodiments, classifiers that discover information in the sourcedata may be arranged to annotate one or more nodes/fields in a raw datagraph.

As used herein “data-type” refers to a type designation for the type ofcontent of a raw field value. Typical examples may include, string,numeric, date-time, text, images, time-date, video, location(geo-spatial), or the like. A content-type may be included in the fieldsand/or nodes of a raw data graph.

As used herein the terms “classification,” and “classification type”refer to an indication of the type of information a raw data field mayrepresent. A raw field value may be classified as being a type ofinformation, such as, person first name, person last name, person name,business name, street address, email address, telephone number, date,time, postal codes, social security numbers, or the like. Aclassification type represent a higher level concept than a data type.

As used herein the terms “query,” and “query string” refer to commandsand/or sequences of commands that are used for querying, searchingand/or retrieving data from a semantic modeling system. Queriesgenerally produce a result or results depending on the form andstructure of the particular query string. Query results may be sortedand grouped based on the structure and form of the query string. In atleast one of the various embodiments, query strings may includeoperators and functions for calculating values based on the storedrecords, including functions that produce result sets that may includestatistics and metrics about the data stored in data repository.Structured Query Language (SQL) is a well-known query language oftenused to form queries for relational databases. However, the variousembodiments are not limited to using SQL-like formatting for querystrings. Accordingly, other well-known query languages and/or customquery languages may be employed consistent with what is claimed herein.

As used herein, “n-grams” refers to a contiguous set of alpha-numericcharacters (grams) having a fixed number of members (n). N-grams caninclude words, numbers, combinations letters and numbers, whitespace,combinations of words, or the like, or combination thereof. N-grams maybe extracted from string/text values for generating index information.Accordingly, user may generate queries that include n-grams for locatingrecords and/or information that may be associated with one or more ofthe n-grams included in query.

The following briefly describes the various embodiments to provide abasic understanding of some aspects of the invention. This briefdescription is not intended as an extensive overview. It is not intendedto identify key or critical elements, or to delineate or otherwisenarrow the scope. Its purpose is merely to present some concepts in asimplified form as a prelude to the more detailed description that ispresented later.

Briefly stated, embodiments are directed towards dynamic semantic modelshaving multiple indices. In at least one of the various embodiments,source data may be provided to a network computer from at least oneseparate data source. A raw data graph may be generated from the sourcedata such that the structure of the raw data graph may be based on thestructure of the source data. In at least one of the variousembodiments, generating the raw data graph may include providing thesource data to one or more classifiers that may be identified on aclassifier registration list and modifying one or more raw data graphelements based on actions performed by the one or more classifiers.

In at least one of the various embodiments, one or more elements of theraw data graph may be mapped to a concept graph. In at least one of thevarious embodiments, mapping the one or more elements of the raw datagraph to a concept graph may include determining one or more raw datagraph elements based on one or more annotations that classifiers mayhave added to the raw data graph elements. Further, in at least one ofthe various embodiments, concept instances may be generated based on theconcept graph, the raw data graph, and the source data. In someembodiments, model-identifiers (MIDs) that correspond to the one or moreconcept instances may be generated such that MIDs include at least apath in the concept graph and one or more value keys that may correspondto one or more portions of the source data. In at least one of thevarious embodiments, the values from the source data that correspond tothe MIDs may be indexed into indices that may be selected from aplurality of indices based on a content-type of the source dataassociated with the MIDs. In some embodiments, indexing the MIDs mayinclude generating one or more index records that may include semanticequivalents of the value of one or more MIDs. Also, in otherembodiments, the plurality of indices may include at least one indexthat is optimized for a content-type of text, at least one index that isoptimized for a content-type of time, at least one index that isoptimized for a content-type of geo-spatial information, or the like.

Further, in at least one of the various embodiments, in response to aquery, a result set may be generated that includes result MIDs based onone or more indices of the plurality of indices such that content-typesin the query may be employed to select the indices used to generate theresult set.

In at least one of the various embodiments, raw data graph elements maybe generated based on the source data such that the value of the rawdata graph elements may be absent from the source data. Also, one ormore additional queries may be generated based on the result set of aprevious query. And, in at least one of the various embodiments, theconcept graph may be selected based on one or more ontologies.

Illustrative Operating Environment

FIG. 1 shows components of one embodiment of an environment in whichembodiments of the invention may be practiced. Not all of the componentsmay be required to practice the invention, and variations in thearrangement and type of the components may be made without departingfrom the spirit or scope of the invention. As shown, system 100 of FIG.1 includes local area networks (LANs)/wide area networks(WANs)-(network) 110, wireless network 108, client computers 102-105,Semantic Modeling System Server Computer 116, or the like.

At least one embodiment of client computers 102-105 is described in moredetail below in conjunction with FIG. 2. In one embodiment, at leastsome of client computers 102-105 may operate over one or more wiredand/or wireless networks, such as networks 108, and/or 110. Generally,client computers 102-105 may include virtually any computer capable ofcommunicating over a network to send and receive information, performvarious online activities, offline actions, or the like. In oneembodiment, one or more of client computers 102-105 may be configured tooperate within a business or other entity to perform a variety ofservices for the business or other entity. For example, client computers102-105 may be configured to operate as a web server, firewall, clientapplication, media player, mobile telephone, game console, desktopcomputer, or the like. However, client computers 102-105 are notconstrained to these services and may also be employed, for example, asfor end-user computing in other embodiments. It should be recognizedthat more or less client computers (as shown in FIG. 1) may be includedwithin a system such as described herein, and embodiments are thereforenot constrained by the number or type of client computers employed.

Computers that may operate as client computer 102 may include computersthat typically connect using a wired or wireless communications mediumsuch as personal computers, multiprocessor systems, microprocessor-basedor programmable electronic devices, network PCs, or the like. In someembodiments, client computers 102-105 may include virtually any portablecomputer capable of connecting to another computer and receivinginformation such as, laptop computer 103, mobile computer 104, tabletcomputers 105, or the like. However, portable computers are not solimited and may also include other portable computers such as cellulartelephones, display pagers, radio frequency (RF) devices, infrared (IR)devices, Personal Digital Assistants (PDAs), handheld computers,wearable computers, integrated devices combining one or more of thepreceding computers, or the like. As such, client computers 102-105typically range widely in terms of capabilities and features. Moreover,client computers 102-105 may access various computing applications,including a browser, or other web-based application.

A web-enabled client computer may include a browser application that isconfigured to receive and to send web pages, web-based messages, and thelike. The browser application may be configured to receive and displaygraphics, text, multimedia, and the like, employing virtually anyweb-based language, including a wireless application protocol messages(WAP), and the like. In one embodiment, the browser application isenabled to employ Handheld Device Markup Language (HDML), WirelessMarkup Language (WML), WMLScript, JavaScript, Standard GeneralizedMarkup Language (SGML), HyperText Markup Language (HTML), eXtensibleMarkup Language (XML), JavaScript Object Notation (JSON), or the like,to display and send a message. In one embodiment, a user of the clientcomputer may employ the browser application to perform variousactivities over a network (online). However, another application mayalso be used to perform various online activities.

Client computers 102-105 also may include at least one other clientapplication that is configured to receive and/or send content betweenanother computer. The client application may include a capability tosend and/or receive content, or the like. The client application mayfurther provide information that identifies itself, including a type,capability, name, and the like. In one embodiment, client computers102-105 may uniquely identify themselves through any of a variety ofmechanisms, including an Internet Protocol (IP) address, a phone number,Mobile Identification Number (MIN), an electronic serial number (ESN),or other device identifier. Such information may be provided in anetwork packet, or the like, sent between other client computers,semantic modeling system server computer 116, source data servercomputer 118, or other computers.

Client computers 102-105 may further be configured to include a clientapplication that enables an end-user to log into an end-user accountthat may be managed by another computer, such as semantic modelingsystem server computer 116, source data server computer 118, or thelike. Such an end-user account, in one non-limiting example, may beconfigured to enable the end-user to manage one or more onlineactivities, including in one non-limiting example, project management,software development, system administration, configuration management,search activities, social networking activities, browse variouswebsites, communicate with other users, or the like. Further, clientcomputers may be arranged to enable users to provide raw data,configuration information, data curation information, queries, or thelike, to semantic modeling system server computer 116. Also, clientcomputers may be arranged to enable users to display reports and/orresults provided by semantic modeling system server computer 116.

Wireless network 108 is configured to couple client computers 103-105and its components with network 110. Wireless network 108 may includeany of a variety of wireless sub-networks that may further overlaystand-alone ad-hoc networks, and the like, to provide aninfrastructure-oriented connection for client computers 103-105. Suchsub-networks may include mesh networks, Wireless LAN (WLAN) networks,cellular networks, and the like. In one embodiment, the system mayinclude more than one wireless network.

Wireless network 108 may further include an autonomous system ofterminals, gateways, routers, and the like connected by wireless radiolinks, and the like. These connectors may be configured to move freelyand randomly and organize themselves arbitrarily, such that the topologyof wireless network 108 may change rapidly.

Wireless network 108 may further employ a plurality of accesstechnologies including 2nd (2G), 3rd (3G), 4th (4G) 5th (5G) generationradio access for cellular systems, WLAN, Wireless Router (WR) mesh, andthe like. Access technologies such as 2G, 3G, 4G, 5G, and future accessnetworks may enable wide area coverage for mobile computers, such asclient computers 103-105 with various degrees of mobility. In onenon-limiting example, wireless network 108 may enable a radio connectionthrough a radio network access such as Global System for Mobilcommunication (GSM), General Packet Radio Services (GPRS), Enhanced DataGSM Environment (EDGE), code division multiple access (CDMA), timedivision multiple access (TDMA), Wideband Code Division Multiple Access(WCDMA), High Speed Downlink Packet Access (HSDPA), Long Term Evolution(LTE), and the like. In essence, wireless network 108 may includevirtually any wireless communication mechanism by which information maytravel between client computers 103-105 and another computer, network, acloud-based network, a cloud instance, or the like.

Network 110 is configured to couple network computers with othercomputers, including, semantic modeling system server computer 116,source data server computer 118, client computers 102-105 throughwireless network 108, or the like. Network 110 is enabled to employ anyform of computer readable media for communicating information from oneelectronic device to another. Also, network 110 can include the Internetin addition to local area networks (LANs), wide area networks (WANs),direct connections, such as through a universal serial bus (USB) port,other forms of computer-readable media, or any combination thereof. Onan interconnected set of LANs, including those based on differingarchitectures and protocols, a router acts as a link between LANs,enabling messages to be sent from one to another. In addition,communication links within LANs typically include twisted wire pair orcoaxial cable, while communication links between networks may utilizeanalog telephone lines, full or fractional dedicated digital linesincluding T1, T2, T3, and T4, and/or other carrier mechanisms including,for example, E-carriers, Integrated Services Digital Networks (ISDNs),Digital Subscriber Lines (DSLs), wireless links including satellitelinks, or other communications links known to those skilled in the art.Moreover, communication links may further employ any of a variety ofdigital signaling technologies, including without limit, for example,DS-0, DS-1, DS-2, DS-3, DS-4, OC-3, OC-12, OC-48, or the like.Furthermore, remote computers and other related electronic devices couldbe remotely connected to either LANs or WANs via a modem and temporarytelephone link. In one embodiment, network 110 may be configured totransport information of an Internet Protocol (IP).

Additionally, communication media typically embodies computer readableinstructions, data structures, program modules, or other transportmechanism and includes any information non-transitory delivery media ortransitory delivery media. By way of example, communication mediaincludes wired media such as twisted pair, coaxial cable, fiber optics,wave guides, and other wired media and wireless media such as acoustic,RF, infrared, and other wireless media.

One embodiment of semantic modeling system server computer 116 isdescribed in more detail below in conjunction with FIG. 3. Briefly,however, semantic modeling system server computer 116 includes virtuallyany network computer capable of generating and/or managing semanticmodel in network environment.

Although FIG. 1 illustrates semantic modeling system server computer116, and source data server computer 118, each as a single computer, theinnovations and/or embodiments are not so limited. For example, one ormore functions of semantic modeling system server computer 116, sourcedata server computer 118, or the like, may be distributed across one ormore distinct network computers. Moreover, semantic modeling systemserver computer 116, source data server computer 118, are not limited toa particular configuration such as the one shown in FIG. 1. Thus, in oneembodiment, semantic modeling system server computer 116, or source dataserver computer 118 may be implemented using a plurality of networkcomputers. In other embodiments, server computers may be implementedusing a plurality of network computers in a cluster architecture, apeer-to-peer architecture, or the like. Further, in at least one of thevarious embodiments, semantic modeling system server computer 116 orsource data server computer 118 may be implemented using one or morecloud instances in one or more cloud networks. Accordingly, theseinnovations and embodiments are not to be construed as being limited toa single environment, and other configurations, and architectures arealso envisaged.

Illustrative Client Computer

FIG. 2 shows one embodiment of client computer 200 that may be includedin a system in accordance with at least one of the various embodiments.Client computer 200 may include many more or less components than thoseshown in FIG. 2. However, the components shown are sufficient todisclose an illustrative embodiment for practicing the presentinvention. Client computer 200 may represent, for example, oneembodiment of at least one of client computers 102-105 of FIG. 1.

As shown in the figure, client computer 200 includes a processor device,such as processor 202 in communication with a mass memory 226 via a bus234. In some embodiments, processor 202 may include one or more centralprocessing units (CPU) and/or one or more processing cores.

Client computer 200 also includes a power supply 228, one or morenetwork interfaces 236, an audio interface 238, a display 240, a keypad242, an illuminator 244, a video interface 246, an input/outputinterface 248, a haptic interface 250, and a global positioning system(GPS) receiver 232.

Power supply 228 provides power to client computer 200. A rechargeableor non-rechargeable battery may be used to provide power. The power mayalso be provided by an external power source, such as an alternatingcurrent (AC) adapter or a powered docking cradle that supplements and/orrecharges a battery.

Client computer 200 may optionally communicate with a base station (notshown), or directly with another computer. Network interface 236includes circuitry for coupling client computer 200 to one or morenetworks, and is constructed for use with one or more communicationprotocols and technologies including, but not limited to, GSM, CDMA,TDMA, GPRS, EDGE, WCDMA, HSDPA, LTE, user datagram protocol (UDP),transmission control protocol/Internet protocol (TCP/IP), short messageservice (SMS), WAP, ultra wide band (UWB), IEEE 802.16 WorldwideInteroperability for Microwave Access (WiMax), session initiatedprotocol/real-time transport protocol (SIP/RTP), or any of a variety ofother wireless communication protocols. Network interface 236 issometimes known as a transceiver, transceiving device, or networkinterface card (NIC).

Audio interface 238 is arranged to produce and receive audio signalssuch as the sound of a human voice. For example, audio interface 238 maybe coupled to a speaker and microphone (not shown) to enabletelecommunication with others and/or generate an audio acknowledgementfor some action.

Display 240 may be a liquid crystal display (LCD), gas plasma, lightemitting diode (LED), organic LED, or any other type of display usedwith a computer. Display 240 may also include a touch sensitive screenarranged to receive input from an object such as a stylus or a digitfrom a human hand.

Keypad 242 may comprise any input device arranged to receive input froma user. For example, keypad 242 may include a push button numeric dial,or a keyboard. Keypad 242 may also include command buttons that areassociated with selecting and sending images.

Illuminator 244 may provide a status indication and/or provide light.Illuminator 244 may remain active for specific periods of time or inresponse to events. For example, when illuminator 244 is active, it maybacklight the buttons on keypad 242 and stay on while the clientcomputer is powered. Also, illuminator 244 may backlight these buttonsin various patterns when particular actions are performed, such asdialing another client computer. Illuminator 244 may also cause lightsources positioned within a transparent or translucent case of theclient computer to illuminate in response to actions.

Video interface 246 is arranged to capture video images, such as a stillphoto, a video segment, an infrared video, or the like. For example,video interface 246 may be coupled to a digital video camera, aweb-camera, or the like. Video interface 246 may comprise a lens, animage sensor, and other electronics. Image sensors may include acomplementary metal-oxide-semiconductor (CMOS) integrated circuit,charge-coupled device (CCD), or any other integrated circuit for sensinglight.

Client computer 200 also comprises input/output interface 248 forcommunicating with external devices, such as a headset, or other inputor output devices not shown in FIG. 2. Input/output interface 248 canutilize one or more communication technologies, such as USB, infrared,Bluetooth™, or the like.

Haptic interface 250 is arranged to provide tactile feedback to a userof the client computer.

For example, the haptic interface 250 may be employed to vibrate clientcomputer 200 in a particular way when another user of a computer iscalling. In some embodiments, haptic interface 250 may be optional.

Client computer 200 may also include GPS transceiver 232 to determinethe physical coordinates of client computer 200 on the surface of theEarth. GPS transceiver 232, in some embodiments, may be optional. GPStransceiver 232 typically outputs a location as latitude and longitudevalues. However, GPS transceiver 232 can also employ othergeo-positioning mechanisms, including, but not limited to,triangulation, assisted GPS (AGPS), Enhanced Observed Time Difference(E-OTD), Cell Identifier (CI), Service Area Identifier (SAI), EnhancedTiming Advance (ETA), Base Station Subsystem (BSS), or the like, tofurther determine the physical location of client computer 200 on thesurface of the Earth. It is understood that under different conditions,GPS transceiver 232 can determine a physical location within millimetersfor client computer 200; and in other cases, the determined physicallocation may be less precise, such as within a meter or significantlygreater distances. In one embodiment, however, client computer 200 maythrough other components, provide other information that may be employedto determine a physical location of the computer, including for example,a Media Access Control (MAC) address, IP address, or the like.

Mass memory 226 includes a Random Access Memory (RAM) 204, a Read-onlyMemory (ROM) 222, and other storage means. Mass memory 226 illustratesan example of computer readable storage media (devices) for storage ofinformation such as computer readable instructions, data structures,program modules or other data. Mass memory 226 stores a basicinput/output system (BIOS) 224, or the like, for controlling low-leveloperation of client computer 200. The mass memory also stores anoperating system 206 for controlling the operation of client computer200. It will be appreciated that this component may include ageneral-purpose operating system such as a version of UNIX, or LINUX™,or a specialized client communication operating system such as MicrosoftCorporation's Windows Mobile™, Apple Corporation's iOS™, GoogleCorporation's Android™, or the like. The operating system may include,or interface with a Java virtual machine module that enables control ofhardware components and/or operating system operations via Javaapplication programs.

Mass memory 226 further includes one or more data storage 208, which canbe utilized by client computer 200 to store, among other things,applications 214 and/or other data. For example, data storage 208 mayalso be employed to store information that describes variouscapabilities of client computer 200. The information may then beprovided to another computer based on any of a variety of events,including being sent as part of a header during a communication, sentupon request, or the like. Data storage 208 may also be employed tostore social networking information including address books, buddylists, aliases, user profile information, user credentials, or the like.Further, data storage 208 may also store messages, web page content, orany of a variety of user generated content.

At least a portion of the information stored in data storage 208 mayalso be stored on another component of client computer 200, including,but not limited to processor readable storage media 230, a disk drive orother computer readable storage devices (not shown) within clientcomputer 200.

Processor readable storage media 230 may include volatile,non-transitive, non-transitory, nonvolatile, removable, andnon-removable media implemented in any method or technology for storageof information, such as computer- or processor-readable instructions,data structures, program modules, or other data. Examples of computerreadable storage media include RAM, ROM, Electrically ErasableProgrammable Read-only Memory (EEPROM), flash memory or other memorytechnology, Compact Disc Read-only Memory (CD-ROM), digital versatiledisks (DVD) or other optical storage, magnetic cassettes, magnetic tape,magnetic disk storage or other magnetic storage devices, or any otherphysical medium which can be used to store the desired information andwhich can be accessed by a computer. Processor readable storage media230 may also be referred to herein as computer readable storage mediaand/or computer readable storage device.

Applications 214 may include computer executable instructions which,when executed by client computer 200, transmit, receive, and/orotherwise process network data. Network data may include, but is notlimited to, messages (e.g. SMS, Multimedia Message Service (MMS),instant message (IM), email, and/or other messages), audio, video, andenable telecommunication with another user of another computer.Applications 214 may include, for example, a browser 218, and otherapplications 220.

Browser 218 may include virtually any application configured to receiveand display graphics, text, multimedia, messages, and the like,employing virtually any web based language. In one embodiment, thebrowser application is enabled to employ HDML, WML, WMLScript,JavaScript, SGML, HTML, XML, and the like, to display and send amessage. However, any of a variety of other web-based programminglanguages may be employed. In one embodiment, browser 218 may enable auser of client computer 200 to communicate with another networkcomputer, such as semantic modeling system server computer 116, sourcedata server computer 118, or the like, as shown in FIG. 1.

Other applications 220 may include, but are not limited to, calendars,search programs, email clients, IM applications, SMS applications, voiceover Internet Protocol (VOIP) applications, contact managers, taskmanagers, transcoders, database programs, word processing programs,software development tools, security applications, spreadsheet programs,games, search programs, and so forth.

Illustrative Network Computer

FIG. 3 shows one embodiment of a network computer 300, according to oneembodiment of the invention. Network computer 300 may include many moreor less components than those shown. The components shown, however, aresufficient to disclose an illustrative embodiment for practicing theinvention. Network computer 300 may be configured to operate as aserver, client, peer, a host, cloud instance, or any other computer.Network computer 300 may represent, for example semantic modeling systemserver computer 116, and/or other network computers, such as, sourcedata server computer 118.

Network computer 300 includes one or more processor devices, such as,processor 302. Also, network computer 300 includes processor readablestorage media 328, network interface unit 330, an input/output interface332, hard disk drive 334, video display adapter 336, and memory 326, allin communication with each other via bus 338.

As illustrated in FIG. 3, network computer 300 also can communicate withthe Internet, or other communication networks, via network interfaceunit 330, which is constructed for use with various communicationprotocols including the TCP/IP protocol. Network interface unit 330 issometimes known as a transceiver, transceiving device, or networkinterface card (NIC).

Network computer 300 also comprises input/output interface 332 forcommunicating with external devices, such as a keyboard, or other inputor output devices not shown in FIG. 3. Input/output interface 332 canutilize one or more communication technologies, such as USB, infrared,NFC, Bluetooth™, or the like.

Memory 326 generally includes RAM 304, ROM 322 and one or more permanentmass storage devices, such as hard disk drive 334, tape drive, opticaldrive, and/or floppy disk drive. Memory 326 stores operating system 306for controlling the operation of network computer 300. Anygeneral-purpose operating system may be employed. Basic input/outputsystem (BIOS) 324 is also provided for controlling the low-leveloperation of network computer 300.

Although illustrated separately, memory 326 may include processorreadable storage media 328. Processor readable storage media 328 may bereferred to and/or include computer readable media, computer readablestorage media, and/or processor readable storage device. Processorreadable storage media 328 may include volatile, nonvolatile,non-transitory, non-transitive, removable, and non-removable mediaimplemented in any method or technology for storage of information, suchas computer readable instructions, data structures, program modules, orother data. Examples of processor readable storage media include RAM,ROM, EEPROM, flash memory or other memory technology, CD-ROM, digitalversatile disks (DVD) or other optical storage, magnetic cassettes,magnetic tape, magnetic disk storage or other magnetic storage devices,or any other media which can be used to store the desired informationand which can be accessed by a computer.

Memory 326 further includes one or more data storage 308, which can beutilized by network computer 300 to store, among other things,applications 314 and/or other data. For example, data storage 308 mayalso be employed to store information that describes variouscapabilities of network computer 300. The information may then beprovided to another computer based on any of a variety of events,including being sent as part of a header during a communication, sentupon request, or the like. Data storage 308 may also be employed tostore messages, web page content, or the like. At least a portion of theinformation may also be stored on another component of network computer300, including, but not limited to processor readable storage media 328,hard disk drive 334, or other computer readable storage medias (notshown) within network computer 300.

Data storage 308 may include a database, text, spreadsheet, folder,file, or the like, that may be configured to maintain and store useraccount identifiers, user profiles, email addresses, IM addresses,and/or other network addresses; or the like. Data storage 308 mayfurther include program code, data, algorithms, and the like, for use bya processor device, such as processor 302 to execute and performactions. In one embodiment, at least some of data store 308 might alsobe stored on another component of network computer 300, including, butnot limited to processor-readable storage media 328, hard disk drive334, or the like.

Further, in at least one of the various embodiments, a network computerarranged as a source data computer, such as, source data server computer118 may include one or more hard drives, optical drives, solid statestorage drives or the like, for storing the raw and/or source data thatmay be processed by semantic modeling system server computer 116.

Data storage 308 may include multiple content indices 310. In at leastone of the various embodiments, content indices 310 may includeinformation for various content indices such as n-gram indices, temporalindices, geospatial indices, or the like. Also, in at least one of thevarious embodiments, data storage 308 may include model identity (MID)indices 311 for storing join indices, inverted MID indices, and otherhelper indices. Further, in at least one of the various embodiments,data storage 308 may include model graphs 312 for representing theorganization and/or structure of concepts and/or information that may bemodeled.

Applications 314 may include computer executable instructions, which maybe loaded into mass memory and run on operating system 306. Examples ofapplication programs may include transcoders, schedulers, calendars,database programs, word processing programs, Hypertext Transfer Protocol(HTTP) programs, customizable user interface programs, IPSecapplications, encryption programs, security programs, SMS messageservers, IM message servers, email servers, account managers, and soforth. Applications 314 may also include, web server 316, ingestionengine 318, indexer application 319, mapping engine 320, knowledgemanager application 321, or the like.

Web server 318 may represent any of a variety of information andservices that are configured to provide content, including messages,over a network to another computer. Thus, web site server 318 caninclude, for example, a web server, a File Transfer Protocol (FTP)server, a database server, a content server, email server, or the like.Website server 318 may provide the content including messages over thenetwork using any of a variety of formats including, but not limited toWAP, HDML, WML, SGML, HTML, XML, Compact HTML (cHTML), Extensible HTML(xHTML), or the like.

Illustrative Logical Architecture

FIG. 4 shows a logical schematic of a portion of semantic modelingsystem 400 in accordance with at least one of the various embodiments.Briefly, in at least one of the various embodiments, system 400 mayinclude ingestion manager 402, mapping manager 404, knowledge manager406, multiple indices 408, model graph 410, and model graph 412. In atleast one of the various embodiments, other/additional parts of system400 not shown in FIG. 4 may include raw data, raw data graphs, MIDindices, join indices, inverted/reverse indices, or the like.

In at least one of the various embodiments, ingestion manager 402 may bearranged to perform actions to process source data as it is added to thesystem. Data may be provided from various sources, including, filesstored on local or remote file systems, streaming data, one or moresource data computers, such as, source data server computer 118, or thelike.

In at least one of the various embodiments, ingestion manager 402 may bearranged to process source data to produce one or more raw data graphsbased on the inherent structure of the raw the data.

In at least one of the various embodiments, mapping manager 404 may bearranged to generate and/or facilitate the generation of concept graphssuch as, concept graph 410 and/or concept graph 412. In at least one ofthe various embodiments, mapping manager 404 may be arranged to map rawdata nodes and/or data fields produced by ingestion manager 402 toconcept nodes in one or more concept models. Further, in at least one ofthe various embodiments, mapping manager 404 may be arranged to generateone or more MIDs that may be indexed in indices, such as, indices 408.

In at least one of the various embodiments, there may be multipleindices that may be employed for indexing MIDs. The particular indexthat may be selected for indexing a MID may depend on the content typeof the source data. For example: content for MIDs representingtext/character values may be indexed using n-gram indices; MIDsrepresenting time-based values may be indexed in temporal indices; orMIDs representing geographical/geo-spatial values may be indexed ingeo-spatial indices. Accordingly, in at least one of the variousembodiments, different type of data may be indexed using indices thatmay be optimized for the content-types of the values associated with theMID.

In at least one of the various embodiments, various actions such asresponding to queries or data mining may be performed by knowledgemanager 406. In at least one of the various embodiments, knowledgemanager 406 may be arranged to generate result sets in response toqueries or other commands provided by users or remote applications.Further, in at least one of the various embodiments, users may beenabled to provide queries and other commands using a graphicaluser-interface and/or web page.

FIG. 5A shows a logical schematic of a portion of ingestion engine 500in accordance with at least one of the various embodiments. As discussedbriefly above, ingestion engines, such as, ingestion engine 500 may bearranged to process source data records to generate raw data graphs thatrepresent the content structure of the source data.

In at least one of the various embodiments, ingestion engine 500 may beprovided source data represented by data record 502. In at least one ofthe various embodiments, an ingestion engine may ingest data recordsform a variety of sources. Further, the data records may be provided indifferent formats, such as, XML, HTML, office application documents,databased export files, database result sets, log files, unstructureddata, CSV files, data streams, image files, video files, video streams,or the like.

In at least one of the various embodiments, as each source data recordenters ingestion engine 500 it may be provided to an ingestion point,such as, ingestion point 504. In at least one of the variousembodiments, ingestion point 504 represents the logical entry point forsource data to enter the system. In at least one of the variousembodiments, ingestion point 504 may be arranged to perform actions thatinclude generating a payload object that may be a logical envelope fordata record 502 as it is processed by ingestion engine 500.

In at least one of the various embodiments, source data an ingestionengine may generate a raw data graph based on the structure of sourcedata. In at least one of the various embodiments, if the source data isprovided using an XML file, the ingestion engine may generate a raw datagraph based on the structure embedded in the XML file. Also, in at leastone of the various embodiments, if the source data is a database exportfile, the shape of the raw data graph may be generated from the databaseschema that is associated with the database export file.

In at least one of the various embodiments, payload 506 may be comprisedof XML, JSON, or other structured data formats, including data structuresuch as, lists, hashes, objects, or the like. Initially, in at least oneof the various embodiments, payload 506 may include minimal information,such as, an identifier, a time-stamp, the source data record, areference to the source data record, or the like.

In at least one of the various embodiments, payload 506 may be providedto a classification pipeline, such as, classification pipeline 508. Inat least one of the various embodiments, classification pipeline 508 mayinclude one or more classifiers, such as, classifier 510, classifier512, classifier 514, and so on. In at least one of the variousembodiments, pipeline 508 may include the one or more classifiers thatare registered for the pending ingestion process. For example, in atleast one of the various embodiments, pipeline 508 may be arranged toselect the one or more classifiers from a registration list,registration database, or other configuration information.

Further, in at least one of the various embodiments, pipeline 508 may bearranged to provide payload 506 to each registered classifier in turn.In at least one of the various embodiments, the particular order inwhich the classifiers operate on payload 506 may be determined based ona rank order associated with each classifier. For example, in at leastone of the various embodiments, the order that the classifiers arelisted in a registration list may correspond to the order that they areenabled to operate on payload 506. In other embodiments, classifiers maybe assigned a rank, or priority value by a user or in configurationinformation. In at least one of the various embodiments, in some casesone or more classifiers may be defined as eligible for running inparallel with each other.

In at least one of the various embodiments, classifiers may be arrangedto receive payload 506 and perform one or more actions for classifyingsource data record 502. In at least one of the various embodiments,classifiers may comprise one or more scripts, policies, rules, functionsor processes for analyzing and/or classifying the information includedin the payload.

Also, in at least one of the various embodiments, a classifier maymodify the payload by adding some or all of the results (if any)generated or determined during its turn at processing the payload.Accordingly, in at least one of the various embodiments, subsequentlyexecuted classifiers may be arranged to recognize, process, and/or reactto modifications to the payload that may be made by one or more of theupstream classifiers.

In at least one of the various embodiments, an individual classifier maybe comprised of processor readable instructions and/or configurationinformation that may be arranged to recognize and extract content forparticular types of data records.

In at least one of the various embodiments, classifiers may be arrangedto examine the content of the data record to determine one or moreentities and/or resources that may be embedded or included in the datarecord. In at least one of the various embodiments, classifiers mayinclude heuristic tests that may be made up of one or more, of patternmatches, content matches, or the like. For example, in at least one ofthe various embodiments, a classifier, such as, classifier 512, may beconfigured to identify first name and last name information from stringcontent having a particular format. In this example, classifier 512 mayinclude one or more pattern matching expressions (e.g., regularexpressions) for identifying information in the incoming record and/orpayload that may correspond a person's first name and last name. In atleast one of the various embodiments, one or more well-known patternmatching and/or data extraction techniques may be employed with theparticular patterns and extractors adapted to the formatting and contentof the source record.

Accordingly, in some embodiments, the configuration of a classifier maybe adapted to one or more characteristics of the data record, such as, atype of data record (e.g., patient record, web-server log file, financetransaction logs, and so on), a format of the data records (e.g., WL,CSV, JSON, HTML, or the like), source of the data records, or the like.

In at least one of the various embodiments, payload 516 represents thepayload after each classifier in pipeline 508 has had an opportunity toexamine and process the data record and add its result information, ifany. For example, in this example, payload 516 may include the resultsproduced by classifier 510, classifier 512, and classifier 514. In someembodiments, such results may be include annotations that may beincluded in a raw data graph. For example, classifiers that are arrangedfor identifying dates, person names, telephone numbers, email addresses,physical addresses, or the like, may annotate the fields of raw datagraphs node accordingly. In some embodiments, classifiers may include aconfidence score that corresponds with their annotation. For example, aclassifier that is arranged for identifying fields that represent emailaddresses include a confidence score (e.g., 30%, 80%, or the like) thatindicates how well the raw data field matched to conditions of theclassifier.

In at least one of the various embodiments, sending data record 502through pipeline 508 may produce information corresponding to a raw datagraph portion, such as raw data graph portion 518 that represents theraw data nodes and fields that were determined by the ingestion engineand/or classifiers.

In at least one of the various embodiments, data record 520 illustratesthat the source data may be preserved in an unmodified state. In atleast one of the various embodiments, the source data record may bestored in its original state. This at least enables the same data to bereprocessed at a later date. Also, it enables users to review/access theoriginal source data record as needed.

FIG. 5B shows a logical schematic of a portion of ingestion engine 500in accordance with at least one of the various embodiments. In at leastone of the various embodiments, raw data and/or source data such as data522 may be ingestion at an ingestion point, such as, ingestion point524. In at least one of the various embodiments, ingestion point 524 maybe provided the raw data (as described above) and process to produce aninitial raw data graph, such as, raw data graph 526. In at least one ofthe various embodiments, the structure of raw data graph 526 mayinitially be determined based on the structure of the source data thatis ingested.

In at least one of the various embodiments, if the source data issuccessfully ingested, it may next be classified. In at least one of thevarious embodiments, a classification engine, such as, classificationengine 528 may be arranged to execute one or more classifiers, such as,classifiers 530.

In at least one of the various embodiments, the one or more classifiersmay be registered with the ingestion engine. However, in someembodiments, they may be arranged to execute after the source data hasbeen initially ingested. In at least one of the various embodiments,classifiers 502 may be arranged to perform similar as those describedfor FIG. 5A. However, in some embodiments, classifiers 530 may bearranged to execute directly on the raw data graph rather being executedusing a pipeline with payload architecture.

Accordingly, in at least one of the various embodiments, ifclassification engine 528 finishes its initial run, raw data graph 532may be generated. Raw data graph 532 may represent raw data graph 526 asmodified by classifiers 530.

FIG. 6 illustrates a logical representation of a portion of semanticmodeling system 600 in accordance with at least one of the variousembodiments. As discussed above, the ingestion engine may produce a rawdata graph from the source data for mapping to one or more conceptmodels. Accordingly, one or more concept model graphs may be employed torepresent the structure of the concept model.

In at least one of the various embodiments, concept graph 602 andconcept graph 604 represent the structure of the concept models that aremapped to fields and nodes of a raw data graph that was produced duringthe ingestion process. In at least one of the various embodiments, thestructure of relationships of the concepts may be logically representedas a graph of nodes and edges. Nodes may represent concepts and theedges may represent relationships between concepts. In at least one ofthe various embodiments, some concepts may be represented by separateconcept model graphs each having separate root nodes, such as, root node606 and root node 614. The particular shape of a concept model graph maybe determined by an ontology that may define the concepts and theirrelationships. Nodes and fields from a raw data graph may be mapped toconcepts and/or properties in the concept models. In at least one of thevarious embodiments, multiple concept model graphs having differentarrangements (shapes) may be generated from the same raw data graphand/or source data depending on how the raw data may be mapped toconcept model.

For example, if data records from a movie database were the source data,concept graph 602 may be arranged such that node 608 represents movies,node 610 represents the release date of a movie, and node 612 representsactors that may be in a movie. Accordingly, fields from a raw data modelmay be mapped to the concepts in concept graph 602. Additionalconcepts/entities not shown here associated with movies may flesh outthe concept graph, such as, production company, locations, nationalorigin, language, producers, directors, or the like.

As mentioned, in at least one of the various embodiments, concept graph602 represents just one particular shape that ingested movie databaseinformation may be modeled. For example, concept graph 604 may bearranged to represent people. Accordingly, node 616 may representpersons, with other nodes representing features of the persons, such as,node 618 may represent the first name of a person, and node 620 mayrepresent the last name of a person, and so on. Thus, as shown in thisexample, one or more different concept graphs having differentstructural shapes may be based on and/or mapped to the same ingestedsource data depending on the configuration of the ontologies of theconcept models and/or the mapping processes employed.

FIG. 7 illustrates a logical representation of a portion of semanticmodel 700 showing a referential relationship between two portions of themodel in accordance with at least one of the various embodiments. In atleast one of the various embodiments, a mapping engine may be arrangedto generate multiple concept model graphs from the same source data. Forexample, if a movie database is being ingested, information in the rawdata may be mapped to concepts related to the movies in the database.Accordingly, from root 706, the mapping engine may map information fromthe raw data to a movie concept represented by movie node 708, and otherconcepts (properties of a movie concept) shown as child nodes, such as,movie title (node 710), movie release date (node 712), and actor (node712). Note, in some embodiments, additional nodes not shown here may beincluded to represent other concepts, such as, producer, director,assistant director, and so on. They are omitted here for brevity andclarity. However, one of the ordinary skill the art will appreciate thatadditional concepts may be used in a semantic model for a moviedatabase.

In at least one of the various embodiments, a concept in one model maybe arranged to reference a concept that may be part of another semanticmodel. Accordingly, for example, in addition to generating a conceptmodel for movie, the system may be arranged to generate a concept modelthat represents persons in general. Naturally, actors from the movieinformation, would qualify as persons and may be represented in theperson model as well as in the movie model. In the example shown in FIG.7, concept graph 704 includes root node 716 and person concept node 718.Further, since the actor concepts in concept graph 702 are also persons,edge 720 represents that the values for a person concept (person node718) may come from values for an actor concept (node 714). Also, in atleast one of the various embodiments, there may be more than onereference relationship for defining the values for person concepts (node718). For example, if the movie model (represented by model graph 702)included a director concept node (not shown), there may be an additionalreference from person concept node 718 to the director concept node (notshown) in model graph 702.

Thus, in at least one of the various embodiments, a mapping process mayidentify or create relationship for concepts in one concept model toconcepts in other concept models. For example, in this example,different concepts in different models having properties, such as, firstname, last name, or the like, may be modeled using person concept modelas well as in the movie concept model.

FIG. 8 illustrates a logical representation of a portion of semanticmodel 800 showing a referential relationships in accordance with atleast one of the various embodiments. In at least one of the variousembodiments, in addition to generating relationships between concepts indifferent concept models, the mapping engine may include processes thatshape information from referenced concepts to suit the needs of anotherconcept model that referencing them. For example, in at least one of thevarious embodiments, concept model 802, comprises, root node 806, moviesconcept 808, actor concept 810, credits ranking concept 812, first nameconcept 814, and last name concept 816. Whereas model 804 whichrepresent a person model, comprises, root node 818, person concept 820,first name concept and last name concept 824. As discussed above, amapping process may be arranged to reference actor concepts in model804, since actors are persons. However, in this example, in at least oneof the various embodiments, since concept model 804 represents personsin general, it may not be appropriate for all of the information forassociated with an actor concept to be included in the person concept inconcept model 804. For example, actor concept 810 may be arranged toinclude credit rank concept 812 for representing the rank orderplacement of an actor in a movie's credits. Actor's that are the mostpopular or most important for a movie may have a higher rank than a lesspopular actor's rank. In at least one of the various embodiments, eventhough this type of information may be relevant for actors it may beunlikely to be relevant for persons in general. Accordingly, in thisexample, person concept 820 may be arranged to reference actor concept810, but person concept 820 omits the concepts/properties, such as,credit rank concept 812 that is part of actor concept 810.

FIG. 9 illustrates a logical representation of a portion of the indexingprocess for a semantic modeling system 900 showing a portion ofmodel-identifiers (MID' s) that may be generated for a concept model. Inat least one of the various embodiments, semantic modeling system 900may be arranged for generating semantic models that use multipleindices. Accordingly, FIG. 9 shows three major components of system 900,including, concept model 902, indexer 904 and indices 906.

In at least one of the various embodiments, concept model 902 includesthe concepts defined by an ontology and mapped to a raw data by amapping engine, such as, mapping engine 320. For this example, and tohelp provide clarity in this description, concept model 902 may be anexample of a portion of a concept model based on an ontology for movies.As such, concept model 902 may comprise, movie concept 908, movie titleconcept 910, movie release date concept 912, actor concept 941, actorfirst name concept 916, actor last name concept 918, actor rank concept920, or the like.

In at least one of the various embodiments, indexer 904 may employ modelpaths, such as model path 908 to represents structural information thatcorresponds to the logical representation of the concept in model graph902. Accordingly, as further discussed below, the path information maybe included in MID that corresponds to an instance of the concept.

Accordingly, in this example, in at least one of the variousembodiments, movie title concept 910 has a corresponding model path of‘/MovieDB/Movie/Title’. The path itself does not define a particularinstance of concept, rather it describes where concepts of this typeexist in the concept model. Thus, they are structural in nature,defining the shape of the information rather than the particular valuefor a concept. Likewise, in at least one of the various embodiments,movie actor concept 914 has a model path of ‘MovieDB/Movie/Actor’. And,movie actor first name concept 916 has a path of‘/MovieDB/Movie/Actor/First Name’.

In at least one of the various embodiments, indices 906 may be where thevalues that correspond to particular concept instances may be indexed.In at least one of the various embodiments, indices 906 include multipleindices because values for the concepts discovered during ingestion maybe indexed using indices that may be optimized for the data type of thevalue. In at least one of the various embodiments, indexer 904 mayemploy meta-data such as content-type information that may be includedthe raw data graph nodes that are mapped to the concept to selectindices for indexing a concept.

Referring to concept model 902 as an example, values associated withmovie title concept 910 may be indexed in an n-gram indices. Because thevalue for titles is text information suitable for indexing with n-gramindex. Likewise, in at least one of the various embodiments, valuesassociated with movie released date concept 912 may be indexed in atemporal index since the value is a time value. And, in at least one ofthe various embodiments, values associated with concepts that representgeographic information, such as movie country of origin, or actor'sbirthplace may be indexed using a geo-spatial index since their valuesare geospatial. Accordingly, the values associated with the concepts maybe indexed (and searched for) using indexes that are optimized for thetype of data comprising their underlying values.

FIG. 10 illustrates model-identifiers (MIDs) in accordance with at leastone of the various embodiments. In at least one of the variousembodiments, model-identifiers (MIDs) may be employed for identifying aparticular instance of a concept in a concept model. For at least one ofthe various embodiments, layout 1000 represents a layout for MIDs thatmay be arranged to include various fields. In at least one of thevarious embodiments, MIDs may include field 1002 that may represent themodel path for a concept within a concept model graph. Accordingly, thelength of field 1002 may vary depending on the concept model graph and aconcept's location in the graph. In most embodiments, the model path maystart with a root node followed by the various concept nodes that wouldbe visited during a traversal of a model graph to the concept.

In at least one of the various embodiments, field 1004 may hold thekeys, if any, that correspond to the individual instances of conceptsincluded in the path. In some embodiments, the keys may be necessary toidentify the particular instance(s) of a concept that in the path. Forexample, some concepts represented in a model graph may represent morethan one particular instance of a concept. This is possible and/orlikely because the model graph represents the structure of theinformation rather than pointing to particular instances of data.Accordingly, in some embodiments, if a MID path includes concepts thatcorrespond to multiple concept instances the key is provided for eachmultivalued concept in the model path to determine a particular instanceof concept that is represented by the MID. Further, for fields that havesingular representations a key value of zero may be supplied.

For example, in at least one of the various embodiments, MID 1008illustrates a particular instance of a concept. In this example, MID1008 represents the concept of a first name for an actor in a movie. Inthis example, field 1010 includes the path within the concept modelgraph for the concept. Field 1012, field 1014, and field 1016 hold akeys corresponding to particular values or instances of the conceptrepresented by the path portion. In this example, field 1012 correspondsto the root of the model graph, field 1014 identifies the particularmovie that the movie portion of the path in the MID represents; andfield 1016 holds a key representing the particular actor for the conceptinstance.

As discussed, in at least one of the various embodiments, mappingprocesses may be arranged to generate MIDs for the concept instancesthat may be associated with raw data nodes that are discovered duringingestion. In at least one of the various embodiments, each conceptinstance in a semantic model may be represented by at least one uniqueMID. However, in at least one of the various embodiments, values fromthe same source data may be represented by multiple MIDs in the conceptmodel. For example, MID 1008 represents a MID for an actor's first namethat may be included in a source data record. In this example, anothermapping process may be arranged to produce a ‘Person’ concept thatrepresents all the persons identified by in the source data,accordingly, it may generate another different MID, such as MID 1020that also refers to the actor's first name (since the actor is also aperson). In this example, MID 1020 may be generated from the same sourcedata as MID 1008 but by a different mapping process than the one thatproduced MID 1008. Accordingly, in at least one of the variousembodiments, field 1022 includes the path within the model graph for theconcept; field 1024 is key value for root of the model graph; and field1026 is key that identifies the particular person.

In at least one of the various embodiments, MIDs may be compressed orotherwise transformed to reduce storage size and/or to reduce processingcosts. For example, hash 1030 may be generated by hashing MID 1008 togenerate a unique hash key that may be used to represent MID 1008. Inthis example, hash 1030 is generated using the SHA-1 hashing algorithm.In other embodiments, other hashing algorithms and/or compressionalgorithms may be employed.

In at least one of the various embodiments, the path portion of the MIDmay be represented using one or more numeric encoding schemes forrepresenting a path in a graph. However, for clarity, herein paths areusually shown as expanded strings. (e.g., path 1010).

FIGS. 11, 12, and 13 show portions of indices that may be produced by anindexer, such as, indexer 319 and/or a knowledge manager application320. As discussed above, classifiers and mapping processes are arrangedto discover and extract concepts from source data. In at least one ofthe various embodiments, a mapping process may be arranged to generateone or more MIDs that are associated with instances of the concepts.After mapping, the MIDs may be provided to the indexer application, suchas, indexer 319 that may be arranged to index the MIDs based on thecontent-types of their associated raw data values.

FIG. 11 shows a portion of index 1100 for indexing n-gram valued MIDs inaccordance with at least one of the various embodiments. In at least oneof the various embodiments, some an indexer may be arranged to identifyraw data fields that have values that are n-grams. Accordingly, the MIDsassociated with these fields may be indexed in an index that isoptimized for n-grams.

For example, in at least one of the various embodiments, index 1100 mayinclude various columns, such as, N-gram (column 1102), path (column1104), key (column 1106), extra data (column 1108), or the like.

In at least one of the various embodiments, column 1102 holds the n-gramvalues that are associated with the MID in the index. Here, for brevity,only one n-gram is shown associated with each MID. However, in someembodiments, multiple n-gram values may be associated with the same MID.For example, if the value of the concept instance associate with the MIDwas “mary had a little lamb” the MID may be associated with n-grams,mary, little, lamb, little lamb, mary had a little lamb, and so on.Also, in at least one of the various embodiments, n-gram index keys(column 1102) may include more than one word, for example, “littlelamb,” “little,” and “lamb” may be n-gram keys in the index.

In at least one of the various embodiments, column 1104 may hold theconcept model path that is included in the MID. In at least one of thevarious embodiments, the keys corresponding to the model graph and theconcept instance may be stored in column 1106. In this example, forclarity, the keys are shown in the order they may be applied to themodel path, delimitated by colons. Accordingly, in at least one of thevarious embodiments, for MID 1110, the first row in index 1100, after a0 representing a root node, the next key in column 1106 is 10 whichrepresents an identifier for a particular movie that has been ingested.Likewise, in this example, for MID 1114, column 1106 shows that themovie identified by the key value 20 may be associate with the n-gram‘nighttime’. This means the word nighttime is part of the title for amovie identified by 20. Note, that MID 1114 and MID 1116 both have acommon path values and the same keys. This is because they representdifferent properties in the same concept instance. In contrast, MID 1110and MID 1114 also share that same path information. However, becausethey represent different concept instances (e.g., different movies) theyhave different keys.

In at least one of the various embodiments, the path information incolumn 1104 may be represented in a numerical format such that eachportion of a path corresponds to an integer. For example, in at leastone of the various embodiments, MovieDB may be assigned to a value of 2,Movie may be assigned to a value of 8, and Title may assigned a value of12. Accordingly continuing with this example, the path value for row1110 may be represented as 020812. Likewise, assuming Genre is assignedto correspond with the value 7, the path for row 1112 may be represented020807. In at least one of the various embodiments, such numeric valuesmay employed in the index to facilitate faster indexing as well as morecompact data representation of the paths. In some embodiments, each pathstring may be reduced to a unique string using one or more well-knownhashing algorithms. One of ordinary skill in the art will appreciatethat other compact/numeric schemes may be employed to represent thepaths. The paths are paths in a graph and may represented using variouspath representation techniques. It is in the interest of brevity andclarity that they are shown in an expanded string format throughout thisdocument.

In at least one of the various embodiments, column 1108 isrepresentative of one or more additional columns that may be included inindex 1100. Depending on the type of index, the specific ‘extra data’columns may vary. For example, in some embodiments, n-gram indices mayinclude extra data related to n-grams, as discussed further below.Likewise, other types of indices may have one more columns to hold otherextra data consistent with the type of index.

In at least one of the various embodiments, values that are semanticallyequivalent and/or semantically similar to the n-gram(s) associated witha MID may be stored as extra data for an n-gram index. For example, inindex 1100, MID 1112 is associated with the n-gram ‘comedy’ thus itsextra data values may include words/n-grams that are semanticallyequivalent/similar to ‘comedy’. For example, these may include, funny,humor, humorous, silly, or the like. In at least one of the variousembodiments, semantic equivalents may include words from otherlanguages, such as, Komodie (German), comedia (Spanish), or the like.

In at least one of the various embodiments, extra data may also includewhole-part relationships between terms that are indexed. In at least oneof the various embodiments, terms that have whole-part relationshipswith an indexed concept instance value and/or n-gram may be stored inone or more extra data columns. For example, referring back to MID 1112,comedy is the base term in column 1102. Accordingly, terms representing‘parts’ and/or specializations the notion of comedy may also beassociated with MID 1112, such as, joke, punch-line, stand-up, limerick,or the like. Likewise, in at least one of the various embodiments,broader terms that are inclusive of comedy may be associated with MID1112, such as, story, entertainment, performance, or the like.

Further, in at least one of the various embodiments, as is common forinverted indices in general, a key value n-gram may be associated withmore than one MID. Accordingly, indices such as index 1100 may beassociated multiple MIDs with a key value based on the source data. Forbrevity and clarity, associating multiple MID to a key is not shownherein.

FIG. 12 shows a portion of index 1200 for indexing MIDs for time-basedvalued concept instances in accordance with at least one of the variousembodiments. In at least one of the various embodiments, classifiers mayextract/identify time-based concept instances from source data, such as,birth days, expiration dates, visit dates, release dates, or the like.

In at least one of the various embodiments, time-based indices may beindices that are designed or optimized for indexing time values. TheMIDs associated with time values may be indexed based on the time valuerather than indexing on the n-grams that may be included in thedate-time values. For example, a MID value of ‘Noon, September 18, 2010’may be indexed using the time value, such as, the Julian Date value of2455458 rather being indexed by n-grams, such as ‘noon’, ‘September’,‘18’, ‘2010’, and so on. Further, in at least one of the variousembodiments, different time based indices may convert time values, suchas, time of day, dates, date ranges, durations, or the like, to variousindex-able date formats, such as, Julian, UNIX time, or the like.

For example, in at least one of the various embodiments, index 1200 mayinclude columns similar to those described for index 1100. Columns 1204(Path), 1208 (Keys), and 1210 (extra data) may be considered the same astheir counterparts in index 1100—accordingly, a detailed description isnot included here. In at least one of the various embodiments, the pathinformation in column 1204 may be represented in a numerical formatsimilarly as described above for index 1100.

In at least one of the various embodiments, column 1202 (time)represents the date-time value of the concept instance represented in aformat suitable for time based indexing. In this example, the timevalues for MIDs as shown in column 1208 are converted to numeric datevalue (Julian Date). One of ordinary skill will appreciate that othertime formats may be used depending on requirements of the time-basedindex that being used.

In at least one of the various embodiments, extra data for MIDs in index1200 may include additional time-based information that may beassociated with the MID. For example, if the time value of a MID isclose to a holiday or other day/time of significance, it may beindicated in one or more extra data columns.

Further, in at least one of the various embodiments, as is common forinverted indices in general, a key temporal value may be associated withmore than one MID. Accordingly, indices such as index 1200 may beassociated multiple MIDs with a key value based on the source data. Forbrevity and clarity, associating multiple MID to a key is not shownherein.

FIG. 13 shows a portion of index 1300 for indexing geo-spatial valuedMIDs in accordance with at least one of the various embodiments.Accordingly, in at least one of the various embodiments, MIDs thatinclude geo-spatial values may be indexed in indices that may beoptimized for geo-spatial information.

Except for the geo-spatial fields, in at least one of the variousembodiments, index 1300 may include columns similar to those describedfor index 1100 and index 1200. Columns 1304 (Path), 1306 (Keys), and1308 (extra data) may be considered the same as their counterparts inindex 1100 and index 1200—accordingly, a detailed description is notincluded here. In at least one of the various embodiments, the pathinformation in column 1304 may be represented in a numerical formatsimilarly as described above for index 1100.

In at least one of the various embodiments, geo-spatial values forconcept instances discovered by various classifiers may be arrangedand/or converted into various formats that may be compatible withindexing geo-spatial information, such as, latitude/longitudecoordinates, polygon information, or the like. In this example, column1304 represents the geo-spatial information for indexing. For example,MID 1310 represents a concept instance that is a location (Sidney,Australia) where a movie first premiered. Accordingly rather than indexthe MID using the n-grams, Sidney, and Australia, the MID may be indexedbased on its GPS coordinates, or latitude and longitude.

In at least one of the various embodiments, extra data information forgeo-spatial indices may include additional geo-spatial information thatmay be associated with the concept instance such as, altitude, terraintype, other GIS information, or the like.

FIG. 14 illustrates a logical representation of the modeling process fora semantic modeling system 1400 in accordance with at least one of thevarious embodiments. In at least one of the various embodiments, aningestion engine, such as, ingestion engine 1402 may be arranged toreceive source data from one or more sources (as described above).Ingestion engine 1402 performs actions for parsing the source data andgenerating a raw data graph, such as raw data graph 1404. In at leastone of the various embodiments, raw data graph 1404 may be a graphrepresentation of the structure of the source data.

In at least one of the various embodiments, a mapping engine, such as,mapping engine 1406 may be arranged to map nodes and fields from rawdata graph to a concept graph, such as, concept graph 1408. In at leastone of the various embodiments, mapping engine 1406 may be arranged toperform automatic mapping as well as facilitating user curation actions.

In at least one of the various embodiments, concept graph 1408 may bearranged to represent one or more ontologies. Accordingly, the conceptsand relationships in the ontologies may be associated with nodes andfields in the raw data graph. In at least one of the variousembodiments, concept graph 1408 may be comprised of portions of one ormore ontologies known and/or pre-defined ontologies that may be storedin an ontology data store, such as, ontology data store 1410. Forexample, graph 1412, graph 1414, and graph 1416 represent graphs for oneor more ontologies that may be available.

In at least one of the various embodiments, concept graphs, such as,concept graph 1408 may represent a single or whole pre-defined ontology.Also, in some embodiments, concept graph 1408 may be customized for aparticular application, and so on.

Further, in at least one of the various embodiments, as is common forinverted indices in general, a key geographic/spatial value may beassociated with more than one MID. Accordingly, indices such as index1400 may be associated multiple MIDs with a key value based on thesource data. For brevity and clarity, associating multiple MID to a keyis not shown herein.

FIG. 15 shows a logical representation mapping raw data to concepts formodeling system 1500 in accordance with at least one of the variousembodiments. As described above an ingestion engine (not shown) may bearranged to process provided source data to generate a raw data graph,such as, raw data graph 1502. Likewise, as discussed above, a mappingengine (not shown) may be arranged to perform action for mapping rawdata nodes and fields to concepts and/or concept properties thatcomprise a concept graph, such as concept graph 1504.

In at least one of the various embodiments, nodes of a raw data graph,such as, raw data graph 1502 may be arranged into namespaces, such as,namespace 1506, schema nodes 1508, and fields 1510. For example, if thesource data was a database file, namespace 1056 may include nodesrepresenting the name of a databases included in the file, such as,Movies, Accounting, Medical Charts, or the like. Likewise, for thisexample of an ingested database file, schema nodes 1508 may representtables in the database. And, fields 1510 may represent columns of eachtable.

In some embodiments, if the source data from an XML file the structureand shape of the raw data graph (namespace, nodes, and fields) may bedefined by the structure of the XML file. For data sources such assystem log files the scheme nodes may be arranged based on log recordtype, and so on.

In at least one of the various embodiments, concept graph 1504 may bearranged in namespace 1512, concept nodes 1514, and concept properties1516. In at least one of the various embodiments, the namespace, conceptnodes, concept properties, or the like, may be determined based on anontology for one or more data domains.

In at least one of the various embodiments, a user may define theconcept graph by combining portions of one or more existing ontologies.Also, a concept graph may be custom defined for a particularapplication. In at least one of the various embodiments, concept graph1504 may be considered to be the structure of a model rather than thedata and/or contents of the ingested source data. Likewise, the raw datagraph represents the structure of the ingestion source data rather thanthe actual records.

In at least one of the various embodiments, as described above, one ormore classifiers may be arranged to perform actions to augment and/orreshape ingested data. Accordingly, classifiers may be configured togenerate schema nodes and/or fields in the raw data graph to representfeatures that may not be readily and/or inherently visible/present inthe source data. For example, if a semantic modeling system is employedto ingest a large database of patient medical records, it may be ofvalue to define a field that indicates if a patient has ever had cancer.However, a field corresponding “having cancer” may be representedmultiple ways in any given patient's clinical record. Namely, becausethe patient record may indicate the presence of cancer by using theprecise medical terminology to identify the disease/condition, ratherthan a binary indicator that the patient has cancer.

Accordingly, for example, a classifier may be arranged to generate afield that indicates whether the patient has ever been diagnosed withcancer. In this example, in at least one of the various embodiments, toaccomplish this a classifier, such as, classifier 510 in FIG. 5, may bearranged to determine during the ingestion of a clinical patient record,if a person has been diagnosed with cancer. In this example, theclassifier may be arranged to scan the source data record (the patientrecord) for information that indicates that patient has cancer. Forexample, the classifier may scan the patient diagnoses record in thepatient record to determine if there are matches to one or more of thedozens of different types of known cancers. If the classifier finds amatch, a binary field in the raw data graph may be set accordingly.

Also, in at least one of the various embodiments, classifiers may bearranged to determine various features of the fields in the raw datagraph. For example, as the source data is being ingested one or moreregistered classifiers may analyze the source data to determine if thefield represents an email address, date, time, first name, last name,street address, telephone number, IP address, URL, or the like, orcombination thereof. This feature information may be stored in thecorresponding field nodes of the raw data graph.

In at least one of the various embodiments, classifiers may be arrangedto perform more specialized feature recognition, such as, disambiguationof data types. For example, the corpus of source data may include one ormore, names, acronyms, values, or the like, that may refer to differentconcepts or ideas. Accordingly, a classifier may be arranged to performextended analysis to attempt to disambiguate terms that have differentmeanings depending on the context of their use.

For example, the acronym MPH could refer to a rate of speed(miles-per-hour) or an education credential (Master's in Public Health).In this example, a classifier may be arranged to look in the textsurrounding the ambiguous term in the source data record for indicationsof the meaning. For example, if a number precedes the MPH it may be morelikely that the term refers to miles-per-hour rather than Master ofPublic Health. In at least one of the various embodiments, nodes and/orfields in the raw data graph may be annotated with the disambiguationinformation accordingly.

In at least one of the various embodiments, a mapping engine may bearranged to perform actions to map nodes and fields of the raw datagraph to the concept graph. In some cases, the mapping engine may beenabled to perform automatic mapping based on configuration informationand/or rules defined for the concept graph. In other cases, the mappingengine may present a user with a list of fields that may be likelycandidates for mapping to particular concepts and/or concept propertiesin the concept graph.

In at least one of the various embodiments, in FIG. 15 examples of suchmappings are indicated by double ended arrows, mapping 1528, mapping,1530, mapping 1532, mapping 1534, and mapping 1536. In general, fieldsfrom the raw data graph may be mapped to properties in the conceptgraph. In at least one of the various embodiments, the mapping enginemay selectively map a portion of the fields from a raw data graph nodeto a concept node. For example, two fields from raw node 1520 are mappedto properties of concept node 1522. Likewise, for example, two fieldsfrom raw node 1518 are mapped to properties of concept node 1524 whileone field of raw node 1518 is mapped to a property of concept node 1526.

In at least one of the various embodiments, mapping rules may includereferences to one or more annotations in the raw data graph that mayhave been generated by the classifiers. For example, a mapping rule maymap raw data nodes that have fields, such as, first name, last name,address, and telephone number, or the like, to a person concept in theconcept graph. Likewise, for movie data, if a raw data node includefields such as, title, release date, the raw node may be mapped to amovie concept node in the concept graph.

Further, in some embodiments, ingested source data may includerelationship information such as joins. The ingestion engine may bearranged to recognize joins based on foreign keys in the source data.Accordingly, in some embodiments, the ingestion engine may add a joinedge, such as, edge 1538, to represent the join. In at least one of thevarious embodiments, the edge may be annotated with meta-data toindicate the direction and/or cardinality information for the join.

FIG. 16 shows a portion of forward index 1600 in accordance with atleast one of the various embodiments. In at least one of the variousembodiments, an indexer, such as, indexer 319, may be arranged togenerate one or more indexes that may be employed for associating fieldvalues of the raw data graph with MIDs. As described above, MID may becomprised on a path and keys. In index 1600, column 1602 represents acolumn for holding MID path information and column 1604 represents acolumn in the index for holding the keys information for a MID. And, inat least one of the various embodiments, column 1606 of index 1600 holdsthe field value that is associated with the MID.

Accordingly, in at least one of the various embodiments, row 1608 ofindex 1600 includes data corresponding to a movie title. The informationin row 1608's path column (“/MovieDB/Movie/Title”) describes therepresent concept in terms of its location in in the model graph. Thekeys column of row 1608 holds values representing the key to identify aparticular entity for each variable portion of the path. And, the valuecolumn of row 1608 holds the actual value of concept instance taken fromthe raw data (e.g., source data). In at least one of the variousembodiments, the path information in column 1602 may be represented in anumerical format similarly as described above for index 1100. In atleast one of the various embodiments, such numeric values may employedin the index to facilitate faster indexing as well as more compact datarepresentation of the paths.

In at least one of the various embodiments, index 1600 and others likeit, may be employed quickly determine the source value that isassociated with a particular MID. Thus, for example, indices, such as,index 1100 may be used to lookup MIDs given one or more search terms.And, index 1600 may be employed to determine the raw data values thatare associated with the MIDs.

Generalized Operation

FIGS. 17-23 represent the generalized operation for dynamic semanticmodels using multiple indices in accordance with at least one of thevarious embodiments. In at least one of the various embodiments,processes 1700, 1800, 1900, 2000, 2100, 2200, and 2300 described inconjunction with FIGS. 17-23 may be implemented by and/or executed on asingle network computer, such as network computer 300 of FIG. 3. Inother embodiments, these processes, or portions thereof, may beimplemented by and/or executed on a plurality of network computers, suchas network computer 300 of FIG. 3. In yet other embodiments, theseprocesses, or portions thereof, may be implemented by and/or executed onone or more virtualized computer, such as, those in a cloud-basedenvironment. However, embodiments are not so limited and variouscombinations of network computers, client computers, or the like may beutilized. Further, in at least one of the various embodiments, theprocesses described in conjunction with FIGS. 17-23 may be operative insemantic modeling systems and/or architectures such as those describedin conjunction with FIGS. 4-16.

FIG. 17 shows an overview flow for process 1700 for generating dynamicsemantic models having multiple indices in accordance with at least oneof the various embodiments. After a start block, at block 1702, sourcedata may be provided to an ingestion engine. As described above, theingestion engine may be arranged to process source data provided in avariety of forms and formats. In at least one of the variousembodiments, source data may be provided by way of an API. Also, in atleast one of the various embodiments, an API may be employed by users orother processes to provide information for obtaining the source data(e.g., links, file system information, or the like). In someembodiments, the API may be implemented as a library, as aRepresentational State Transfer (REST) API, remote procedure calls(RPC), or the like, or combination thereof.

At block 1704, in at least one of the various embodiments, the ingestionengine maybe arranged generate a raw data graph that represents thestructure of the ingested source data. In at least one of the variousembodiments, raw data graphs may include schema nodes based on thestructure of the source data as well as fields that represent thefeatures for the schema nodes. For example, if the source data is adatabase, the schema nodes may correspond to tables in the databases andthe fields may correspond to columns of the tables.

In at least one of the various embodiments, one or more classifiers maybe registered to process and/or analysis the source data as it isingested. Classifiers may be determine one or more attributes of the rawdata nodes and fields. Appropriate annotations may be added to the rawdata nodes and/or fields to represent to discovered attributes. Also, insome cases, classifiers may produce new fields that may be added to theraw data graph, as described above.

At block 1706, in at least one of the various embodiments, process 1700may be arranged to generate multiple indices. In at least one of thevarious embodiments, indexes may be generated during the ingestionprocess. Also, in at least one of the various embodiments, indexes maybe refined and/or generated after the mapping between the raw data graphand the concept graph is complete. Indices generated after the mappingmay include the various MID indexes that associate paths in the conceptgraph with source data records.

In at least one of the various embodiments, the generation andrefinement of the indices may be an ongoing process. As users observethe raw data graph and work with the model graphs they may identify oneor more refinements that may be made. Also, as result of queriesresults, the system may be arranged to introduce index informationautomatically. In at least one of the various embodiments, a user orother process may be enabled to generate refinements by interacting withthe system over an API. In at least one of the various embodiments, theAPI may be implemented as a library, as a Representational StateTransfer (REST) API, remote procedure calls (RPC), or the like, orcombination thereof.

At block 1708, in at least one of the various embodiments, a conceptgraph and/or concept model may be determined. In at least one of thevarious embodiments, as discussed above, concept graphs include conceptnodes, concept properties and the relationships between them. Aparticular concept graph may be selected from a collection of availableconcept graphs. Or, in some embodiments, concept graphs may be createdfor a particular application. Further still, a concept graph may begenerated from portions of existing concept graphs. As discussed above,a concept graph may be arranged to correspond to ontologies used formodeling the system that are being modeled.

In at least one of the various embodiments, the concept graph may beselected/determined based on predefined configuration information thatis established before the source data is ingested.

For example, if a user knows that the source data is patient records, aconcept graph for a medical patient ontology may be selected beforeingestion of the source data.

At block 1710, in at least one of the various embodiments, a mappingengine may map the raw data nodes and/or fields to concept nodes andconcept properties. In at least one of the various embodiments, themapping engine may be arranged to include rules for automaticallydetermining mappings between the raw data graph and the concept graph.Also, in at least one of the various embodiments, the mapping engine mayenable users to manually map raw data information to the concept graph.In some embodiments, the mapping engine may identify candidates (nodesand/or fields) in the raw data graph for mapping to the concept graph.Such candidates may be based on rules that are employed by the mappingengine.

At block 1712, optionally, in some embodiments, a user may be enabled tomanually curate the mapping information. Accordingly, the user may beenabled to establish, modify, and/or remove mapping between raw datagraph information and the concept graph. In at least one of the variousembodiments, the user may be enabled to curate the mapping informationusing a graphical user interface, command-line interface, configurationfiles, or the like, or combination thereof.

In at least one of the various embodiments, a user may be enabled tocurate mapping information using an client application that interactswith process 1700 over an API. In at least one of the variousembodiments, the API may implemented in a library, as a RepresentationalState Transfer (REST) API, remote procedure calls (RPC), or the like, orcombination thereof.

At decision block 1714, in at least one of the various embodiments, ifthe indices may be further refined, control may loop back to block 1706;otherwise, control may flow to block 1716. In at least one of thevarious embodiments, interaction of users with the concept model and/orthe raw data graph may indicate that one or more indices may be refined.In at least one of the various embodiments, during a curation session, auser may identify raw data fields that may be incorrectly associatedwith a particular concept. For example, in at least one of the variousembodiments, a user may discover that values that initially whereidentified as social security numbers are actually proprietary healthprovider identifiers. In such cases, a user may make a refinement toassociate the value with a more accurate concept. In some embodiments,refinements may result in one or more indices being updated or modified.

In at least one of the various embodiments, a user or other process maybe enabled to generate refinements by interacting with the system overan API. In at least one of the various embodiments, the API may beimplemented as a library, as a Representational State Transfer (REST)API, remote procedure calls (RPC), or the like, or combination thereof.

At block 1716, in at least one of the various embodiments, if the sourcedata is ingested and the raw data graph is mapped to the concept graphthe system may be considered ready for processing queries and/orsearches.

At decision block 1718, in at least one of the various embodiments, ifrefinement of indices is needed, control may loop back to block 1706;otherwise, control may be returned to a calling process. In at least oneof the various embodiments, results of a query and/or the interaction ofusers with the results of queries may result is refinements to theconcept model. Users may explicitly manipulate the results by grouping,sorting, selecting, or the like. Or, in at least one of the variousembodiments, the process may monitor how a user reacts to results toimplicitly determine refinement to the indices. In at least one of thevarious embodiments, a user or other process may be enabled to generaterefinements by interacting with the system over an API. In at least oneof the various embodiments, the API may be implemented as a library, asa Representational State Transfer (REST) API, remote procedure calls(RPC), or the like, or combination thereof.

FIG. 18 shows an overview flowchart for process 1800 for data ingestionin accordance with at least one of the various embodiments. After astart block, at block 1802, source data may be provided to an ingestionengine. In at least one of the various embodiments, source data may beprovided by one or more source data server computers, such as, sourcedata server computer 118. In at least one of the various embodiments,source data may be provided in the form of documents/records from filesystems, archives, databases, or the like. Also, in at least one of thevarious embodiments, source data may be provided from a continuousstreaming source, such as, audio, video, log streams, event streams, orthe like.

At block 1804, in at least one of the various embodiments, the ingestionengine may generate a payload that may be provide a common format forprocessing the source data. The provided source data may be added to thegenerated payload. In at least one of the various embodiments, thecommon format payload may be arranged to provide a normalized datastructure and/or interface for accessing the source data. In at leastone of the various embodiments, classifiers may be arranged to rely onthe common format of the payload during ingestion.

At block 1806, in at least one of the various embodiments, the payloadmay be provided to each classifier that is registered with the ingestionengine. In at least one of the various embodiments, as discussed above,there may be one or more classifiers, each arranged to perform differentanalysis of the payload and/or source data. Configuration informationthat is accessed by the ingestion engine may include a list of one ormore classifiers that the payload may be provided. In some embodiments,one or more of the classifiers may be serially provided the payloadaccording to a rank order, or prioritization. In other embodiments, oneor more of the classifiers may be provided the payload in parallel.

In at least one of the various embodiments, some classifiers may bearranged to format and/or prepare the source data for inclusion in thepayload. Also, some classifiers may be arranged to generate meta-data,such as, record type, content-type, source, age/date, owner,disambiguation information or the like, to include in the payload. Otherclassifiers may be provided to identify non-obvious/hidden features fromthe source data.

At block 1808, in at least one of the various embodiments, theinformation included in the payload may be employed for generatingschema nodes and fields for the raw data graph.

At decision block 1810, in at least one of the various embodiments, ifmore source data is available, control may loop back to block 1802;otherwise, the ingestion process may be complete and control may bereturned to a calling process.

FIG. 19 shows an overview flowchart for process 1900 for ingestingsource data for a dynamic semantic model in accordance with at least oneof the various embodiments. After a start block, at block 1902, apayload may be provided to each classifier that may be registered withthe ingestion engine. In at least one of the various embodiments, thepayload may be a data structure object that includes the unprocessedsource data as well as annotation information that may have been addedby one or more previously executing classifiers. In at least one of thevarious embodiments, classifiers may be registered with the ingestionengine by a user and/or configuration information. In at least one ofthe various embodiments, some classifiers may be built-in system levelclassifiers that may be arranged to perform system tasks such as addingtimestamps, identifiers, or the like, to setup the payload.

At block 1904, in at least one of the various embodiments, as aclassifier is provided a payload it may perform actions to identifyfeatures in the source data.

In at least one of the various embodiments, classifiers may be arrangedto discover and/or extract feature information from the source dataand/or the payload itself. In some embodiments, one or more classifiersmay be specifically designed to process particular types of source data.These classifiers may be looking for particular fields and/or patternsin the source data that may be identified as features.

In at least one of the various embodiments, classifiers may be arrangedto perform an initial operation to determine if the payload includesinformation that may be relevant to them. Accordingly, in someembodiments, classifiers may be arranged to test values in the payloadmeta-data, such as, record type, content-type, source, age/date, owner,or the like, to determine if the classifier may further process thedata. In at least one of the various embodiments, a classifier that maybe arranged to process a source record from a particular data source,such as a particular patient/clinical record database, may accept ordecline an invitation to process the payload based on the values of oneor more meta-data values. Likewise, in at least one of the variousembodiments, a classifier may be designed to process older sourcerecords (e.g., that may be provided in an older format). Accordingly,such a classifier may be arranged to accept older records that may beolder than a defined date and deny records that may be newer than thedefined date.

At block 1906, in at least one of the various embodiments, one or moreactions performed by a classifier may produce information that may beadded to the payload. In at least one of the various embodiments,classifiers that discover and extract one or more features from thesource data may add them to the payload.

In at least one of the various embodiments, information added to thepayload may be available to other classifiers that may be subsequentlyprovided the payload for processing. Thus, in at least one of thevarious embodiments, features discovered by classifiers based on thecurrent payload may also be added to the payload.

At decision block 1908, in at least one of the various embodiments, ifthere are more classifiers available to process the payload, control mayloop back to block 1902; otherwise, control may flow to block 1910.

At block 1910, in at least one of the various embodiments, since all theregistered classifiers have had an opportunity to process the payload,the payload may be provided to an indexer, such as, indexer 319. In atleast one of the various embodiments, the payload provided to theindexer may include the information that may have been added to thepayload by the classifiers. The indexer may generate the raw data graphfrom the information in the payload. The feature information that wasdetermined and/or discovered by the classifiers may be added to elementsof the raw data graph as annotations to provide more information aboutthe graph element. Next, control may be returned to a calling process.

FIG. 20 shows an overview flowchart for process 2000 for performingactions to classify information and discover features in the source datafor a dynamic semantic model in accordance with at least one of thevarious embodiments. After a start block, at block 2002, a classifiermay be determined for analyzing the raw data. In at least one of thevarious embodiments, as described above one or more classifiers may beregistered with a classification engine. In at least one of the variousembodiments, classifiers may execute in their order of registration. Insome embodiments, other configuration information and/or rule basedpolicies may be employed to determine which classifier to execute.

At block 2006, in at least one of the various embodiments, theclassifier may be arranged to examine the raw data graph and the sourcecontent that is associated with the raw data graph. In at least one ofthe various embodiments, the raw data graph elements may be arrangedinclude meta-data that may indicate to the classifier how theinformation in the payload should be processed. In at least one of thevarious embodiments, the raw data element may include meta-dataassociated with its correspondent source data. In at least one of thevarious embodiments, such meta-data may include an identity of thesource of the record, record format information, ownership information,creation date, modification date, language, or the like.

In at least one of the various embodiments, one or more classifiers maybe arranged to process source data that may be in particular formats.For example, in at least one of the various embodiments, someclassifiers may be arranged process text files while others may bearranged to process binary data, such as, images, videos, or the like.Likewise, in at least one of the various embodiments, some classifiersmay be designed for processing source records from a particular datasource. For example, in at least one of the various embodiments, it maybe known in advance that source data from a particular source includesinformation and/or formatting that may be unique to that source.Accordingly, one or more classifiers may be arranged to process thesource data having information and/or formatting that may be unique tothat source. Likewise, in at least one of the various embodiments, someclassifiers may be arranged to ignore source data from particular datasources. In at least one of the various embodiments, one or moreclassifiers may be arranged to generate the meta-data used by subsequentclassifiers. In at least one of the various embodiments, there may beone or more built-in classifiers that may be arranged process allincoming source data to produce the meta-data that subsequentclassifiers may use.

In at least one of the various embodiments, a classifier may examine thesource data that is associated with a raw data graph element to extractand/or discover feature information in the source data record. In atleast one of the various embodiments, a classifier may be arranged toexamine the source data to identify patterns of information that may beassociated with one or more features of the source data.

In at least one of the various embodiments, the particular actionsperformed by each classifier may depend on the format of the sourcedata. Likewise, if a classifier arranged to process one or moreparticular data formats determines that the source data is in aunsupported format, the classifier may abort its processing.

For example, if the source data is known to be a XML file, theclassifier may be arranged to process XML. In at least one of thevarious embodiments, the classifier may have access to a Document TypeDefinition (DTD) or other mechanism for validating the XML of the sourcedata. In other embodiments, the classifier may employ pattern matchingfor finding particular labels, attribute names, or the like include inthe XML file rather being limited to a DTD.

In at least one of the various embodiments, some classifiers may bearranged recognize data in multiple formats. For example, in at leastone of the various embodiments, a single classifier may be arrangedprocess XML formatted information as well as JSON formatted information.

In at least one of the various embodiments, classifiers may be arrangedto identify and/or discover a single feature in the source data. Also,in at least one of the various embodiments, the classifier may refer tofeature information that may have been previously added to the raw dataelement by other classifiers.

In at least one of the various embodiments, one or more classifiers maybe arranged to perform actions to augment and/or reshape ingested sourcedata. Accordingly, classifiers may be configured to generateconcepts/concept instances comprising features/fields that may not bereadily and/or inherently visible in the source data. For example, if asemantic modeling system is employed to ingest a large data base ofpatient medical records, it may be advantageous to define a field thatindicates if a patient has ever had cancer. However, the attribute of“having cancer” may be represented multiple ways in any given patient'sclinical record. Namely, because the patient record may indicate thepresence of cancer by using the precise medical terminology to identifythe disease/condition, rather than a binary indictor that the patienthas cancer.

Accordingly, for example, a classifier may be arranged to generate afeature information that may indicate whether the patient has ever beendiagnosed with cancer. In this example, in at least one of the variousembodiments, to accomplish this a classifier may be arranged todetermine from ingesting a clinical patient record if a person has beendiagnosed with cancer. In this example, the classifier may be arrangedto scan the source data record (the patient record) for information thatindicates that patient has cancer. For example, the classifier may scanthe patient diagnoses to determine if there are matches to one or moreof the dozens of different types of known cancers. If the classifierfinds a match, a field in the raw data graph corresponding to thepatient “having cancer” may be set to value of ‘yes’. If the classifierdoes not find a match, the value corresponding to the patient “havingcancer” may be set to ‘no’.

In at least one of the various embodiments, during the ingestion processthis type of augmentation information may be added to the raw data graphduring the classification process as if it was a piece of informationthat was includedin the source data.

At block 2006, in at least one of the various embodiments, optionally, aclassifier may be arranged to employ one or more external informationsources to process the raw data graph elements and/or the source data.In at least one of the various embodiments, a classifier may be arrangedto communicate with one or more external databases or other informationservices. Such communication may be employed for confirming one or morecharacteristics of data that may be discovered in the source data.

In at least one of the various embodiments, a classifier may communicatewith an external information source to confirm that a discoveredidentifier corresponds to a particular feature. For example, aclassifier may be arranged to confirm that certain 9 digit strings maybe associated with an employee, customer, patient, or the like.

Further, in at least one of the various embodiments, a classifier may bearranged to communicate with external information sources to obtainadditional information. For example, if a classifier is arranged todiscover and extract a features related to an employee identifier, itmay also be arranged to communicate with an external database to obtainmore information about the employee. Some or all of the informationprovided by external information source may be added to the raw datagraph.

At decision block 2008, in at least one of the various embodiments, ifthe classifier has discovered and/or extracted feature information,control may flow to block 2010; otherwise, control may be returned to acalling process.

At block 2010, in at least one of the various embodiments, some or allof the feature information discovered and/or extracted by the classifiermay be added to the raw data graph. In at least one of the variousembodiments, features and/or information discovered duringclassification may result in modification to the concept graph.Accordingly, classification may determine additional properties of a rawdata field that indicate that it is or is not associated with aparticular concept. For example, during classification, if a stringvalue initially interpreted as a person's name is reclassified as abusiness name, this may cause the raw data be associated with adifferent concept, such as, a company rather than an employee or person.Further, if additional properties/features are added to the raw datagraph based on classification, one or more indices may be updated toincorporate that information. Next, control may be returned to a callingprocess.

FIG. 21 shows an overview flowchart for process 2100 for indexinginformation for a dynamic semantic model with multiple indices inaccordance with at least one of the various embodiments. After a startblock, at block 2102, MIDs and their corresponding value may be providedto an indexer, such as, indexer 319. In at least one of the variousembodiments, the MIDs provided to the indexer may comprise informationsuch as that described for MID 1012 or MID 1026 as described inconjunction with FIG. 10.

At block 2104, in at least one of the various embodiments, the indextype for the MID may be determined. In at least one of the variousembodiments, the raw data graph element mapped to the MID may include afeature information that represents the content-type of the underlyingvalue of concept instance that is represented by the MID. Accordingly,in at least one of the various embodiments, the indexer may be arrangedto select an index from among a plurality of indices for indexing theMID. In at least one of the various embodiments, the index may beselected based on configuration information that includes a mapping ofcontent-type values to indices. For example, in at least one of thevarious embodiments, MIDs representing text concept instances and/orvalues may be associated with an n-gram index. Likewise, in at least oneof the various embodiments, MIDs representing temporal (date/time)concept instances and/or value may be associated with a temporal index.And, in at least one of the various embodiments, MIDs representinggeo-spatial concept instances and/or values may be associated with ageo-spatial index.

Further, in at least one of the various embodiments, multiple indicesmay be optimized for the same content-type but each have differentconfigurations. Also, in some cases, in at least one of the variousembodiments, more than one index may be selected for a single MID. Forexample, there may be multiple time-based indices each having differenttime-range/time-bucket configurations. In some cases, for example, onetime-based/temporal index may be configured to provide optimizedindexing for days (24 hour periods) while another may be configured toprovide optimized indexing time values in terms of seconds.

Likewise, in at least one of the various embodiments, there may bemultiple n-gram indices each optimized for a one or more particulartypes of n-grams. For example, some indices may be arranged to beoptimized to support different languages and/or character sets.

At block 2106, in at least one of the various embodiments, the indexermay generate an index record that corresponds to the MID. In at leastone of the various embodiments, index records, such as those displayedin FIGS. 11-13 may be generated for MIDs that are provided to theindexer. In at least one of the various embodiments, the actual formatof the index record may vary depending on the implementation of theindex. In at least one of the various embodiments, each index record mayinclude a model graph for determining where the concept instancerepresented by the MID fits within the structure of the semantic model.Also, each index record may include the actual index keys. Such as,n-grams for n-gram values, time/data for temporal values, geo-spatiallocation information for geo-spatial values, or the like.

In at least one of the various embodiments, multiple index records maybe generated for each MID depending on the value of the MID and the typeof index. Accordingly, in at least one of the various embodiments, if avalue of concept instance represented by a MID includes multiplen-grams, multiple index records may be generated to correspond with eachn-gram. For example, if a

MID represents a movie title concept instance of “Nightmare in Georgia,”the indexer may generate index records for n-grams such as, nightmare,Georgia, ‘nightmare in georgia’, and so on.

As discussed above, MIDs representing concept instances having temporalvalues may be indexed based on the time value. And, MIDs representingconcept instances having geo-spatial values may be indexed based on thegeo-spatial information.

Further, in at least one of the various embodiments, the indexer mayextract the keys and values from the MIDs for storing in the indexrecord. Likewise, in some embodiments, information for retrieving theunderlying source data record may be determined from the MID and addedto the index record. In at least one of the various embodiments, thisinformation may be a URI, or other form of identifier that may beemployed for locating and retrieving the original source data.

In at least one of the various embodiments, since a mapping engine mayproduce multiple MIDs from the same source data record, one or moregenerated index records may include a location/retrieval information forthe same source data record.

At block 2108, in at least one of the various embodiments, one or moreextra data values may be generated based on the value and content-typeof the concept instance that corresponds to the MID. As discussed above,extra data may be one or more additional columns of data that includeadditional data that may be related to the MID. In at least one of thevarious embodiments, some extra data may be common to index records forthe different types of indices, such as, age of record, and so on. Also,in at least one of the various embodiments, extra data columns may varydepending on the type of index. Further, although not shown in FIGS.11-13 or discussed in detail otherwise, index records may include one ormore columns for bookkeeping information, administration, accesscontrol, implementation details, or the like, for supporting theoperation of an index.

Further, in at least one of the various embodiments, n-gram indexrecords may include extra data for representing other n-grams valuesthat may have various relationships, semantic or otherwise, to then-grams and/or concept instance values of the MID. In at least one ofthe various embodiments, the extra data may include words from otherlanguages that having the same or similar meanings,

At block 2110, in at least one of the various embodiments, the generatedindex record may be added to an index that may be selected based on theindex type. In at least one of the various embodiments, as mentionedabove, the selected index may be optimized for the content-type of theconcept instance value of the MID. Accordingly, the selected index maybe indexed the generated index record using one or more well-knowntechniques for indexing the content-type of the concept instanceassociated with the MID.

Also, in at least one of the various embodiments, the indexer maygenerate one or more records for one or more indices. In at least one ofthe various embodiments, a forward index such as, forward index 1600 inFIG. 16 may be generated. Accordingly, in at least one of the variousembodiments, such indexes may be arranged to map various resource and/orconcept instances to MIDs. In at least one of the various embodiments,other indices such as join indices used for relating resources and/orconcepts to other resources and/or concepts may also be generated. Forexample, a join index may be arranged to associate movie resources withactor resources. Likewise, for example, another join index may bearranged to map an actor concept instance with the movie conceptinstance he or she is associated with, and so on.

In at least one of the various embodiments, configuration rules may beapplied to determine the particular join indices and inverted indicesthat may be generated. In at least one of the various embodiments, theindexer may be arranged to recognize relationships betweenresources/MIDs that may benefit from a join index. Accordingly, in atleast one of the various embodiments, the indexer may monitor the numberof resources that have the same parent, if this number exceeds a definedthreshold the indexer may be arranged to generate a join index or aninverted index for mapping the parent resources to its children andvice-versa. In at least one of the various embodiments, the list ofcandidate join indexes, if any, may be presented to a user is agraphical user interface. Accordingly, the user may be enabled to acceptor decline the join indexes. Next, control may be returned to callingprocess.

FIG. 22 shows an overview for process 2200 for mapping raw data graphelements to a concept graph in accordance with at least one of thevarious embodiments. After a start block, at block 2202, a raw datagraph may be generated. As discussed above, the raw data graph may begenerated by an ingestion engine. In at least one of the variousembodiments, the raw data graph may represent the structure of thesource data. Also, in at least one of the various embodiments, the rawdata graph elements may include various annotations generated by the oneor more classifiers that may have processed to the source data and/orpayload during ingestion.

At block 2204, in at least one of the various embodiments, elements inthe concept graph may be traversed by a mapping engine. In at least oneof the various embodiments, the concept graph may have been determinedand/or selected prior to the initiation of this mapping process.Accordingly, concept graph may include one or more concept nodes andconcept properties that have already been defined. However, in at leastone of the various embodiments, the mapping engine must perform theactions to map some or all of the raw data graph elements to some or allof the elements in the concept graph.

At block 2206, in at least one of the various embodiments, one or moreraw data elements from the raw data graph may be determined to map tothe concept element. The mapping engine may be arranged to include oneor more rules for identifying raw data elements that should beautomatically mapped to the concept graph element. In some embodiments,the concept graph element may be associated with one or more rulesand/or conditions that may be applied or tested against elements of theraw data graph. Accordingly, in some embodiments, if a raw data elementmeets enough of the rules/conditions it may be automatically mapped tothe concept graph element.

At block 2208, in at least one of the various embodiments, one or moreraw data elements from the raw data graph may be determined to becandidates for mapping to the concept graph element. In at least one ofthe various embodiments, the mapping engine may be arranged to includeone or more rules for identifying raw data elements that should beidentifies as candidates for mapping to the concept graph element. Insome embodiments, the concept graph element may be associated with oneor more rules and/or conditions that may be applied or tested againstelements of the raw data graph. Accordingly, in some embodiments, if araw data element meets enough of the rules/conditions it may bedetermined to be a candidate for mapping to the concept graph element.

In at least one of the various embodiments, the list of candidate rawdata graph elements, if any, may be presented to a user is a graphicaluser interface. Accordingly, the user may be enabled to accept ordecline the raw data elements that are suggested for mapping.

At block 2210, in at least one of the various embodiments, process 2200may enable a user to manually identify raw data elements for mapping tothe concept graph element. Thus, in at least one of the variousembodiments, a user may employ a user to select one or more raw datagraph elements for mapping to the concept graph elements. In someembodiments, the concept graph element may include one or moreconstraints that may limit how elements may be mapped. For example, inat least one of the various embodiments, a concept graph may preventhave constraints defined to prevent geographic address fields from beingmapped to a telephone field.

At block 2212, in at least one of the various embodiments, thedetermined and/or selected raw data graph elements may be mapped to theconcept graph element. In at least one of the various embodiments, amapping node may be generated and stored in a system graph. The mappingnode include properties that define how the fields in the raw dataelements are mapped to the properties in the concept graph elements.

At decision block 2214, in at least one of the various embodiments, ifthere are more concept graph elements that need to be mapped to the rawdata graph elements, control may loop back to block 2204; otherwise,control may be returned to a calling process.

FIG. 23 shows an overview flowchart for process 2300 for responding toqueries for information from a dynamic semantic model with multipleindices in accordance with at least one of the various embodiments.After a start block, at block 2302, a query may be provided to aknowledge manager, such as, knowledge manager application 321. A usermay provide the query using one or more interfaces, such as,command-line interfaces, GUI interfaces, web interfaces (e.g., RESTAPIs), or the like.

In at least one of the various embodiments, the query may be comprisedof one or more well-known query languages, such as, SQL, ContextualQuery Language (CQL), XQuery, SPARQL Protocol and RDF Query Language(SPARQL), custom query languages, or the like. Also, the query may becomprised of a search terms such as, for a search engine, rather than aformal query language.

At block 2304, in at least one of the various embodiments, the knowledgemanager may determine the content types for the one or more of thesearch terms includes in the query. In at least one of the variousembodiments, the query contents may explicitly call-out or define thecontent type for a query. In other embodiments, the knowledge managermay determine the content-type of query terms based on their values.Accordingly, the knowledge manager may be arranged to employ one or moretechniques such as pattern matching for determining the content-type ofquery terms included in the query.

In at least one of the various embodiments, if the knowledge manager maybe unable to determine a content-type for a query terms, it may treatthe content-type as a default value, such as ‘text/plain’, or the like.In at least one of the various embodiments, the default content-type maybe set using configuration information.

For example, in at least one of the various embodiments, a query stringof ‘smith 1998’ that is provided may result in the term ‘smith’ beingcharacterized as an n-gram type, such as, ‘text/plain’ and ‘1998’ beingcharacterized as temporal data type.

At block 2306, in at least one of the various embodiments, one or moreindices may be selected based on the content-type of the query terms. Inat least one of the various embodiments, if the query contents includesmultiple query terms of different content-types, multiple indices, atleast one for each content-type, may be selected.

At block 2308, in at least one of the various embodiments, the queryterms may be used to generate one or more result sets from the selectedindices. Each query term may be provided to at least one of the selectedindices. Accordingly, results for each query term may be produced fromthe indices.

In at least one of the various embodiments, queries may include termsfor grouping, clustering, or segmenting results. Also, in at least oneof the various embodiments, groups, segments, and/or clusters may bedefined to become concepts that may be added to the semantic model. Forexample, in at least one of the various embodiments, if a query includesterms for segmenting a population (e.g., actors) by age, such as, child,youth, young adult, adult, and so on, the concept ‘age group’ may beintroduced to the semantic model.

At block 2310, in at least one of the various embodiments, the resultsets may be provided to user and/or other application that provided thequery. In at least one of the various embodiments, may be in the form ofa text file, XML file, or the like. In some embodiments, the result setmay provide in the form of a graphical report. In at least one of thevarious embodiments, the graphical reports may be interactive enablingusers to interactively select and/or view relationships between theentities included in the result set.

At block 2312, in at least one of the various embodiments, additionalqueries may be generated and/or the model may be updated based on userinteractions. In at least one of the various embodiments, a user maygenerate additional queries from the results of a previous query. In atleast one of the various embodiments, results from a query may displayone or more concepts that are related to the concept identified by theprevious query. Accordingly, in at least one of the various embodiments,a user may query for the related concepts. In at least one of thevarious embodiments, a user interface may display an interactive list ofthe results, enabling to user to execute addition queries by selectingitems in the list.

Further, in at least one of the various embodiments, queries may beproduce initial results lists that include different concepts in thesame list. For example, a search for “John Smith” may match a MovieActor concept and a Person concept. Thus, in this example, if the userfurther queries (by selecting) the matching Movie Actor conceptadditional results may be generated related from the movie databaseinformation. This may include a list of movies “John Smith” was involvedin, what his roles were, and so on. Likewise, in this example, if a userselected the Person concept corresponding to the “John Smith” anadditional query may return personal information about “John Smith”,such as, email address, age, height, weight, and so on.

Further, in at least one of the various embodiments, the results of aquery may also list source data records that include the query terms.Accordingly, a user may be enabled to retrieve the source data recordscorresponding to the query rather than being limited to the informationincluded in the concept graph.

In at least one of the various embodiments, the indices may be updatedbased on query contents, result sets, or user feedback. In at least oneof the various embodiments, the knowledge manager may be arranged toautomatically highlight semantic information that may be associated withthe entities/resources that may have been involved directly orindirectly in queries.

In at least one of the various embodiments, if a query includes groupingterms (e.g., group by, clustering, segmenting, or the like), the groupsthat were included in the result set may be added to the semantic model.In at least one of the various embodiments, the groups may be used todefine new concepts that may be added to the semantic model. Forexample, if a query includes terms for segmenting a population (e.g.,actors) by age, such as, child, youth, young adult, adult, and so on,the concept ‘age group’ may be introduced to the semantic model. In thisexample, the actor concept may be augmented by adding the ‘age group’concept to the actors with a value of child, youth, young adult, adult,and so on, for each actor. Accordingly, in at least one of the variousembodiments, MIDs for the concept instances discovered by the query maybe generated and indexed similarly as the MIDs determined duringingestion.

In at least one of the various embodiments, the query may explicitlyinclude command language to add groups, clusters, or segments, to thesemantic model. Such command language may include defining a name forthe concept associated with the group. In at least one of the variousembodiments, the knowledge manager may automatically identify queryresults that may be added to the semantic model as concepts. Forexample, in some embodiments the knowledge manager may automaticallygenerate concepts based on the results of repeated group-by queries.

In at least one of the various embodiments, the knowledge manager mayrecognize that one or more sub-set of results may be related,accordingly, knowledge manager may generate concepts that capture therelationships. For example,

Further, in at least one of the various embodiments, indices may beupdated to reflect user feedback. In at least one of the variousembodiments, user feedback may include additional source data that maybe ingested. Accordingly, such user feedback may result in additionalMIDs being added to the indices.

In at least one of the various embodiments, queries (e.g., searches) maybe saved by adding them to the semantic model. Accordingly, a searchnode may be generated and added to the graph database. In at least oneof the various embodiments, the search node may include propertiesrepresenting the result types that may be returned executing the query.In some embodiments, these properties may be explicitly expressed in thequery language of the search. In other cases, the properties may bedetermined based on the actual concept element and/or raw data elementsthat are return in the result set.

FIG. 24 shows an overview flowchart for process 2400 for ingestingsource data for a dynamic semantic model in accordance with at least oneof the various embodiments. After a start block, at block 2402, aningestion engine may ingest the source data producing an initial rawdata graph.

At block 1904, in at least one of the various embodiments, a classifiermay be determined from the set of registered classifiers.

In at least one of the various embodiments, classifiers may be arrangedto discover and/or extract feature information from the source dataand/or the payload itself. In some embodiments, one or more classifiersmay be specifically designed to process particular types of source data.These classifiers may be looking for particular fields and/or patternsin the source data that may be identified as features.

In at least one of the various embodiments, classifiers may be arrangedto perform an initial operation to determine if the payload includesinformation that may be relevant to them.

Accordingly, in some embodiments, classifiers may be arranged to testvalues in the payload meta-data, such as, record type, content-type,source, age/date, owner, or the like, to determine if the classifier mayfurther process the data. In at least one of the various embodiments, aclassifier that may be arranged to process a source record from aparticular data source, such as a particular patient/clinical recorddatabase, may accept or decline an invitation to process the payloadbased on the values of one or more meta-data values. Likewise, in atleast one of the various embodiments, a classifier may be designed toprocess older source records (e.g., that may be provided in an olderformat). Accordingly, such a classifier may be arranged to accept olderrecords that may be older than a defined date and deny records that maybe newer than the defined date.

At block 2406, in at least one of the various embodiments, theclassifier may process the raw data graph information and the sourcedata. In at least one of the various embodiments, classifiers thatdiscover and extract one or more features from the source data may addthem to the payload.

At block 2408, in at least one of the various embodiments, the featureinformation that was determined and/or discovered by the classifier maybe added to fields and/or elements of the raw data graph as annotationsto provide more information about the raw graph element.

At decision block 2410, in at least one of the various embodiments, ifthere are more classifiers available to process the payload, control mayloop back to block 2404; otherwise, control may be returned to a callingprocess.

It will be understood that each block of the flowchart illustration, andcombinations of blocks in the flowchart illustration, can be implementedby computer program instructions. These program instructions may beprovided to a processor to produce a machine, such that theinstructions, which execute on the processor, create means forimplementing the actions specified in the flowchart block or blocks. Thecomputer program instructions may be executed by a processor to cause aseries of operational steps to be performed by the processor to producea computer-implemented process such that the instructions, which executeon the processor to provide steps for implementing the actions specifiedin the flowchart block or blocks. The computer program instructions mayalso cause at least some of the operational steps shown in the blocks ofthe flowchart to be performed in parallel. These program instructionsmay be stored on some type of machine readable storage media, such asprocessor readable non-transitive storage media, or the like. Moreover,some of the steps may also be performed across more than one processor,such as might arise in a multi-processor computer system. In addition,one or more blocks or combinations of blocks in the flowchartillustration may also be performed concurrently with other blocks orcombinations of blocks, or even in a different sequence than illustratedwithout departing from the scope or spirit of the invention.

Accordingly, blocks of the flowchart illustration support combinationsof means for performing the specified actions, combinations of steps forperforming the specified actions and program instruction means forperforming the specified actions. It will also be understood that eachblock of the flowchart illustration, and combinations of blocks in theflowchart illustration, can be implemented by special purposehardware-based systems, which perform the specified actions or steps, orcombinations of special purpose hardware and computer instructions. Theforegoing example should not be construed as limiting and/or exhaustive,but rather, an illustrative use case to show an implementation of atleast one of the various embodiments of the invention.

What is claimed as new and desired to be protected by Letters Patent ofthe United States is:
 1. A method for managing data over a network byusing one or more processors, included with one or more networkcomputers, to perform actions, comprising: providing one or moremodel-identifiers (MIDs) that correspond to one or more conceptinstances, wherein a concept instance is based on source data and a rawdata graph that is mapped to a concept graph; indexing values from thesource data that correspond to the one or more MIDs with one or moredifferent types of indices that are selected from a plurality ofdifferent types of indices based on a content-type of the source data,wherein the different types of indices include one or more of temporalindices or geo-spatial indices; and in response to a query, providing aresult that includes one or more MIDs, wherein a content-type of one ormore portions of the query is employed to select the one or moredifferent types of indices used to generate the result.